[jira] [Updated] (SPARK-32623) Set the downloaded artifact location to explicitly glob test reports

2020-08-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32623:
-
Summary: Set the downloaded artifact location to explicitly glob test 
reports  (was: Set the downloaded artifact location to explicitly glob test 
reportss)

> Set the downloaded artifact location to explicitly glob test reports
> 
>
> Key: SPARK-32623
> URL: https://issues.apache.org/jira/browse/SPARK-32623
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After SPARK-32357, now GitHub Actions reports the test results. However, 
> seems some tests are not reported correctly. I attached an image.
> Looks when you run tests via using tags to exclude/include tests, it still 
> generates empty JUnit test reports. So, when tests are split by using the 
> tags, it seems possible to overwrite the previous proper report.
> We can avoid this problem by explicitly setting the downloaded artifact 
> directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32623) Set the downloaded artifact location to explicitly glob test reportss

2020-08-14 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32623:


 Summary: Set the downloaded artifact location to explicitly glob 
test reportss
 Key: SPARK-32623
 URL: https://issues.apache.org/jira/browse/SPARK-32623
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


After SPARK-32357, now GitHub Actions reports the test results. However, seems 
some tests are not reported correctly. I attached an image.

Looks when you run tests via using tags to exclude/include tests, it still 
generates empty JUnit test reports. So, when tests are split by using the tags, 
it seems possible to overwrite the previous proper report.

We can avoid this problem by explicitly setting the downloaded artifact 
directory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32606) Remove the fork of action-download-artifact in test_report.yml

2020-08-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178162#comment-17178162
 ] 

Hyukjin Kwon commented on SPARK-32606:
--

Here I raised the PR https://github.com/dawidd6/action-download-artifact/pull/24

> Remove the fork of action-download-artifact in test_report.yml
> --
>
> Key: SPARK-32606
> URL: https://issues.apache.org/jira/browse/SPARK-32606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/HyukjinKwon/action-download-artifact/commit/750b71af351aba467757d7be6924199bb08db4ed
> in order to add the support to download all artifacts. It should be 
> contributed back to the original
> plugin and avoid using the fork.
> Alternatively, we can use the official actions/download-artifact once they 
> support to download artifacts
> between different workloads, see also 
> https://github.com/actions/download-artifact/issues/3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32605) Remove the fork of action-surefire-report in test_report.yml

2020-08-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178160#comment-17178160
 ] 

Hyukjin Kwon commented on SPARK-32605:
--

Here I raised the PR https://github.com/ScaCap/action-surefire-report/pull/14

> Remove the fork of action-surefire-report in test_report.yml
> 
>
> Key: SPARK-32605
> URL: https://issues.apache.org/jira/browse/SPARK-32605
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> It was forked to have a custom fix 
> https://github.com/HyukjinKwon/action-surefire-report/commit/c96094cc35061fcf154a7cb46807f2f3e2339476
> in order to add the support of custom target commit SHA. It should be 
> contributed back to the original plugin and avoid using the fork.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178157#comment-17178157
 ] 

Apache Spark commented on SPARK-32622:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/29427

> Add case-sensitivity test for ORC predicate pushdown 
> -
>
> Key: SPARK-32622
> URL: https://issues.apache.org/jira/browse/SPARK-32622
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> We should add case-sensitivity test for ORC predicate pushdown to increase 
> test coverage for ORC predicate pushdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32622:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Add case-sensitivity test for ORC predicate pushdown 
> -
>
> Key: SPARK-32622
> URL: https://issues.apache.org/jira/browse/SPARK-32622
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> We should add case-sensitivity test for ORC predicate pushdown to increase 
> test coverage for ORC predicate pushdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32622:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Add case-sensitivity test for ORC predicate pushdown 
> -
>
> Key: SPARK-32622
> URL: https://issues.apache.org/jira/browse/SPARK-32622
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> We should add case-sensitivity test for ORC predicate pushdown to increase 
> test coverage for ORC predicate pushdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32622) Add case-sensitivity test for ORC predicate pushdown

2020-08-14 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-32622:
---

 Summary: Add case-sensitivity test for ORC predicate pushdown 
 Key: SPARK-32622
 URL: https://issues.apache.org/jira/browse/SPARK-32622
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


We should add case-sensitivity test for ORC predicate pushdown to increase test 
coverage for ORC predicate pushdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32542) add a batch for optimizing logicalPlan

2020-08-14 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang updated SPARK-32542:
--
Description: 
Split an expand into several small Expand, which contains the Specified number 
of projections.
For instance, like this sql.select a, b, c, d, count(1) from table1 group by a, 
b, c, d with cube. It can expand 2^4 times of original data size.
Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
performance in multidimensional analysis when data is huge.


  was:
Split an expand into several small Expand, which contains the Specified number 
of projections.
For instance, like this sql.select a, b, c, d, count(1) from table1 group by a, 
b, c, d with cube. It can expand 2^4 times of original data size.
Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
performance in multidimensional analysis when data is huge.

runBenchmark("cube multianalysis agg") {
  val N = 20 << 20

  val benchmark = new Benchmark("cube multianalysis agg", N, output = 
output)

  def f(): Unit = {
val df = spark.range(N).cache()
df.selectExpr(
  "id",
  "(id & 1023) as k1",
  "cast(id & 1023 as string) as k2",
  "cast(id & 1023 as int) as k3",
  "cast(id & 1023 as double) as k4",
  "cast(id & 1023 as float) as k5")
  .cube("k1", "k2", "k3", "k4", "k5")
  .sum()
  .noop()
df.unpersist()

  }

  benchmark.addCase("grouping = F") { _ =>
withSQLConf(
  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
  SQLConf.GROUPING_WITH_UNION.key -> "false") {
  f()
}
  }

  benchmark.addCase("grouping = T projectionSize= 16") { _ =>
withSQLConf(
  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
  SQLConf.GROUPING_WITH_UNION.key -> "true",
  SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "16") {
  f()
}
  }

  benchmark.addCase("grouping = T projectionSize= 8") { _ =>
withSQLConf(
  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
  SQLConf.GROUPING_WITH_UNION.key -> "true",
  SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "8") {
  f()
}
  }
  benchmark.run()
}
Running benchmark: cube multianalysis agg :
cube 5 fields k1, k2, k3, k4, k5
Running case: GROUPING_WITH_UNION off 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 8

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15
Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
cube multianalysis agg:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

grouping = F  54329  54931 
852  0.42590.6   1.0X
grouping = T projectionSize= 16   44584  44781 
278  0.52125.9   1.2X
grouping = T projectionSize= 842764  43272 
718  0.52039.1   1.3X
Running benchmark: cube multianalysis agg :
cube 6 fields k1, k2, k3, k4, k5, k6
Running case: GROUPING_WITH_UNION off 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15
Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
cube multianalysis agg:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

grouping = F 141607 143424
2569  0.16752.4   1.0X
grouping = T projectionSize= 32  109465 109603 
196  0.25219.7   1.3X
grouping = T projectionSize= 16   99752 100411 
933  0.24756.5   1.4X
Running benchmark: cube multianalysis agg :
cube 7 fields k1, k2, k3, k4, k5, k6, k7
Running case: GROUPING_WITH_UNION off 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 64
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15
Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
cube multianalysis agg: 

[jira] [Updated] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-14 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang updated SPARK-32542:
--
Summary: Add an optimizer rule to split an Expand into multiple Expands for 
aggregates  (was: add a batch for optimizing logicalPlan)

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32542) add a batch for optimizing logicalPlan

2020-08-14 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang updated SPARK-32542:
--
Description: 
Split an expand into several small Expand, which contains the Specified number 
of projections.
For instance, like this sql.select a, b, c, d, count(1) from table1 group by a, 
b, c, d with cube. It can expand 2^4 times of original data size.
Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
performance in multidimensional analysis when data is huge.

runBenchmark("cube multianalysis agg") {
  val N = 20 << 20

  val benchmark = new Benchmark("cube multianalysis agg", N, output = 
output)

  def f(): Unit = {
val df = spark.range(N).cache()
df.selectExpr(
  "id",
  "(id & 1023) as k1",
  "cast(id & 1023 as string) as k2",
  "cast(id & 1023 as int) as k3",
  "cast(id & 1023 as double) as k4",
  "cast(id & 1023 as float) as k5")
  .cube("k1", "k2", "k3", "k4", "k5")
  .sum()
  .noop()
df.unpersist()

  }

  benchmark.addCase("grouping = F") { _ =>
withSQLConf(
  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
  SQLConf.GROUPING_WITH_UNION.key -> "false") {
  f()
}
  }

  benchmark.addCase("grouping = T projectionSize= 16") { _ =>
withSQLConf(
  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
  SQLConf.GROUPING_WITH_UNION.key -> "true",
  SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "16") {
  f()
}
  }

  benchmark.addCase("grouping = T projectionSize= 8") { _ =>
withSQLConf(
  SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "true",
  SQLConf.GROUPING_WITH_UNION.key -> "true",
  SQLConf.GROUPING_EXPAND_PROJECTIONS.key -> "8") {
  f()
}
  }
  benchmark.run()
}
Running benchmark: cube multianalysis agg :
cube 5 fields k1, k2, k3, k4, k5
Running case: GROUPING_WITH_UNION off 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 8

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15
Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
cube multianalysis agg:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

grouping = F  54329  54931 
852  0.42590.6   1.0X
grouping = T projectionSize= 16   44584  44781 
278  0.52125.9   1.2X
grouping = T projectionSize= 842764  43272 
718  0.52039.1   1.3X
Running benchmark: cube multianalysis agg :
cube 6 fields k1, k2, k3, k4, k5, k6
Running case: GROUPING_WITH_UNION off 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15
Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
cube multianalysis agg:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

grouping = F 141607 143424
2569  0.16752.4   1.0X
grouping = T projectionSize= 32  109465 109603 
196  0.25219.7   1.3X
grouping = T projectionSize= 16   99752 100411 
933  0.24756.5   1.4X
Running benchmark: cube multianalysis agg :
cube 7 fields k1, k2, k3, k4, k5, k6, k7
Running case: GROUPING_WITH_UNION off 
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 64
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 32
Running case: GROUPING_WITH_UNION on and GROUPING_EXPAND_PROJECTIONS = 16

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15
Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz
cube multianalysis agg:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

grouping = F 516941 519658 
NaN  0.0   24649.7   1.0X
grouping = T projectionSize= 64  267170 267547 
533  0.1   12739.6   1.9X
grouping = T 

[jira] [Updated] (SPARK-32542) add a batch for optimizing logicalPlan

2020-08-14 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang updated SPARK-32542:
--
Priority: Major  (was: Minor)

> add a batch for optimizing logicalPlan
> --
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Split an expand into several smallExpand,which contains the Specified number 
> of projections.
> For instance,like this sql.select a,b,c,d,count(1) from table1 group by 
> a,b,c,d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2020-08-14 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178148#comment-17178148
 ] 

Rohit Mishra commented on SPARK-32613:
--

[~dagrawal3409], Please avoid populating Fix version/s as they are reserved for 
committers.

> DecommissionWorkerSuite has started failing sporadically again
> --
>
> Key: SPARK-32613
> URL: https://issues.apache.org/jira/browse/SPARK-32613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Priority: Major
>
> Test "decommission workers ensure that fetch failures lead to rerun" is 
> failing: 
>  
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]
> https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32613) DecommissionWorkerSuite has started failing sporadically again

2020-08-14 Thread Rohit Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32613:
-
Fix Version/s: (was: 3.1.0)

> DecommissionWorkerSuite has started failing sporadically again
> --
>
> Key: SPARK-32613
> URL: https://issues.apache.org/jira/browse/SPARK-32613
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Devesh Agrawal
>Priority: Major
>
> Test "decommission workers ensure that fetch failures lead to rerun" is 
> failing: 
>  
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127357/testReport/org.apache.spark.deploy/DecommissionWorkerSuite/decommission_workers_ensure_that_fetch_failures_lead_to_rerun/]
> https://github.com/apache/spark/pull/29367/checks?check_run_id=972990200#step:14:13579
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32619) converting dataframe to dataset for the json schema

2020-08-14 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178147#comment-17178147
 ] 

Rohit Mishra commented on SPARK-32619:
--

[~Manjay7869], Can you please update- Environment detail & reproducible steps. 
Please rewrite the description by introducing code blocks wherever you have 
tried to explain any code snippet.

If you are not sure please try to utilize Stack overflow and User mail list to 
get an answer and probably a solution. Please read the guideline for future 
reference- [https://spark.apache.org/community.html]

> converting dataframe to dataset for the json schema
> ---
>
> Key: SPARK-32619
> URL: https://issues.apache.org/jira/browse/SPARK-32619
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Manjay Kumar
>Priority: Minor
>
> have a schema 
>  
> {
> Details :[{
> phone : "98977999"
> contacts: [{
> name:"manjay"
> -- has missing street block in json
> ]}
> ]}
>  
> }
>  
> Case class , based on schema
> case class Details (
>  phone : String,
> contacts : Array[Adress]
> )
>  
> case class Adress(
> name : String
> street : String
>  
> )
>  
>  
> this throws : No such struct field street - Analysis exception.
>  
> dataframe.as[Details]
>  
> Is this a bug ?? or there is a resolution for this.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32621:


Assignee: Apache Spark

> "path" option is added again to input paths during infer()
> --
>
> Key: SPARK-32621
> URL: https://issues.apache.org/jira/browse/SPARK-32621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> When "path" option is used when creating a DataFrame, it can cause issues 
> during infer.
> {code:java}
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> val path = "/tmp"
> val df = spark.range(2)
> df.write.json(path + "/p=1")
> df.write.json(path + "/p=2")
> val extraOptions = Map(
>   "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
>   "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
> )
> // This works fine.
> assert(spark.read.options(extraOptions).json(path).count == 2)
> // The following with "path" option fails with the following:
> // assertion failed: Conflicting directory structures detected. Suspicious 
> paths
> //file:/tmp
> //file:/tmp/p=1
> assert(spark.read.options(extraOptions).format("json").option("path", 
> path).load.count() === 2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178125#comment-17178125
 ] 

Apache Spark commented on SPARK-32621:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29437

> "path" option is added again to input paths during infer()
> --
>
> Key: SPARK-32621
> URL: https://issues.apache.org/jira/browse/SPARK-32621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> When "path" option is used when creating a DataFrame, it can cause issues 
> during infer.
> {code:java}
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> val path = "/tmp"
> val df = spark.range(2)
> df.write.json(path + "/p=1")
> df.write.json(path + "/p=2")
> val extraOptions = Map(
>   "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
>   "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
> )
> // This works fine.
> assert(spark.read.options(extraOptions).json(path).count == 2)
> // The following with "path" option fails with the following:
> // assertion failed: Conflicting directory structures detected. Suspicious 
> paths
> //file:/tmp
> //file:/tmp/p=1
> assert(spark.read.options(extraOptions).format("json").option("path", 
> path).load.count() === 2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32621:


Assignee: (was: Apache Spark)

> "path" option is added again to input paths during infer()
> --
>
> Key: SPARK-32621
> URL: https://issues.apache.org/jira/browse/SPARK-32621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> When "path" option is used when creating a DataFrame, it can cause issues 
> during infer.
> {code:java}
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> val path = "/tmp"
> val df = spark.range(2)
> df.write.json(path + "/p=1")
> df.write.json(path + "/p=2")
> val extraOptions = Map(
>   "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
>   "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
> )
> // This works fine.
> assert(spark.read.options(extraOptions).json(path).count == 2)
> // The following with "path" option fails with the following:
> // assertion failed: Conflicting directory structures detected. Suspicious 
> paths
> //file:/tmp
> //file:/tmp/p=1
> assert(spark.read.options(extraOptions).format("json").option("path", 
> path).load.count() === 2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-14 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-32621:
--
Priority: Minor  (was: Major)

> "path" option is added again to input paths during infer()
> --
>
> Key: SPARK-32621
> URL: https://issues.apache.org/jira/browse/SPARK-32621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0, 3.0.1, 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> When "path" option is used when creating a DataFrame, it can cause issues 
> during infer.
> {code:java}
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> val path = "/tmp"
> val df = spark.range(2)
> df.write.json(path + "/p=1")
> df.write.json(path + "/p=2")
> val extraOptions = Map(
>   "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
>   "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
> )
> // This works fine.
> assert(spark.read.options(extraOptions).json(path).count == 2)
> // The following with "path" option fails with the following:
> // assertion failed: Conflicting directory structures detected. Suspicious 
> paths
> //file:/tmp
> //file:/tmp/p=1
> assert(spark.read.options(extraOptions).format("json").option("path", 
> path).load.count() === 2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32620) Reset the numPartitions metric when DPP is enabled

2020-08-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32620:

Summary: Reset the numPartitions metric when DPP is enabled  (was: Fix 
number of partitions read metric when DPP enabled)

> Reset the numPartitions metric when DPP is enabled
> --
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32621) "path" option is added again to input paths during infer()

2020-08-14 Thread Terry Kim (Jira)
Terry Kim created SPARK-32621:
-

 Summary: "path" option is added again to input paths during infer()
 Key: SPARK-32621
 URL: https://issues.apache.org/jira/browse/SPARK-32621
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.6, 3.0.1, 3.1.0
Reporter: Terry Kim


When "path" option is used when creating a DataFrame, it can cause issues 
during infer.
{code:java}
class TestFileFilter extends PathFilter {
  override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}

val path = "/tmp"
val df = spark.range(2)
df.write.json(path + "/p=1")
df.write.json(path + "/p=2")

val extraOptions = Map(
  "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
  "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
)

// This works fine.
assert(spark.read.options(extraOptions).json(path).count == 2)

// The following with "path" option fails with the following:
// assertion failed: Conflicting directory structures detected. Suspicious paths
//  file:/tmp
//  file:/tmp/p=1
assert(spark.read.options(extraOptions).format("json").option("path", 
path).load.count() === 2)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32620) Fix number of partitions read metric when DPP enabled

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32620:


Assignee: (was: Apache Spark)

> Fix number of partitions read metric when DPP enabled
> -
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32620) Fix number of partitions read metric when DPP enabled

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32620:


Assignee: Apache Spark

> Fix number of partitions read metric when DPP enabled
> -
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32620) Fix number of partitions read metric when DPP enabled

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178118#comment-17178118
 ] 

Apache Spark commented on SPARK-32620:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29436

> Fix number of partitions read metric when DPP enabled
> -
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32620) Fix number of partitions read metric when DPP enabled

2020-08-14 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32620:

Summary: Fix number of partitions read metric when DPP enabled  (was: Fix 
number of partitions read when DPP enabled)

> Fix number of partitions read metric when DPP enabled
> -
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32620) Fix number of partitions read when DPP enabled

2020-08-14 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-32620:
---

 Summary: Fix number of partitions read when DPP enabled
 Key: SPARK-32620
 URL: https://issues.apache.org/jira/browse/SPARK-32620
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32092) CrossvalidatorModel does not save all submodels (it saves only 3)

2020-08-14 Thread Zirui Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178101#comment-17178101
 ] 

Zirui Xu commented on SPARK-32092:
--

I think CrossValidatorModel.copy() is affected by a similar issue of losing 
numFolds too. I will attempt a fix.

> CrossvalidatorModel does not save all submodels (it saves only 3)
> -
>
> Key: SPARK-32092
> URL: https://issues.apache.org/jira/browse/SPARK-32092
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 2.4.5
> Environment: Ran on two systems:
>  * Local pyspark installation (Windows): spark 2.4.5
>  * Spark 2.4.0 on a cluster
>Reporter: An De Rijdt
>Priority: Major
>
> When saving a CrossValidatorModel with more than 3 subModels and loading 
> again, a different amount of subModels is returned. It seems every time 3 
> subModels are returned.
> With less than two submodels (so 2 folds) writing plainly fails.
> Issue seems to be (but I am not so familiar with the scala/java side)
>  * python object is converted to scala/java
>  * in scala we save subModels until numFolds:
>  
> {code:java}
> val subModelsPath = new Path(path, "subModels") 
>for (splitIndex <- 0 until instance.getNumFolds) {
>   val splitPath = new Path(subModelsPath, 
> s"fold${splitIndex.toString}")
>   for (paramIndex <- 0 until instance.getEstimatorParamMaps.length) {
> val modelPath = new Path(splitPath, paramIndex.toString).toString
> 
> instance.subModels(splitIndex)(paramIndex).asInstanceOf[MLWritable].save(modelPath)
>   }
> {code}
>  * numFolds is not available on the CrossValidatorModel in pyspark
>  * default numFolds is 3 so somehow it tries to save 3 subModels.
> The first issue can be reproduced by following failing tests, where spark is 
> a SparkSession and tmp_path is a (temporary) directory.
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=4,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
> The second as follows (will fail writing):
> {code:java}
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.linalg import Vectors
> def test_save_load_cross_validator(spark, tmp_path):
> temp_path = str(tmp_path)
> dataset = spark.createDataFrame(
> [
> (Vectors.dense([0.0]), 0.0),
> (Vectors.dense([0.4]), 1.0),
> (Vectors.dense([0.5]), 0.0),
> (Vectors.dense([0.6]), 1.0),
> (Vectors.dense([1.0]), 1.0),
> ]
> * 10,
> ["features", "label"],
> )
> lr = LogisticRegression()
> grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(
> estimator=lr,
> estimatorParamMaps=grid,
> evaluator=evaluator,
> collectSubModels=True,
> numFolds=2,
> )
> cvModel = cv.fit(dataset)
> # test save/load of CrossValidatorModel
> cvModelPath = temp_path + "/cvModel"
> cvModel.write().overwrite().save(cvModelPath)
> loadedModel = CrossValidatorModel.load(cvModelPath)
> assert len(loadedModel.subModels) == len(cvModel.subModels)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-14 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-32119.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28939
[https://github.com/apache/spark/pull/28939]

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/14/20, 8:45 PM:
--

Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
if a new interface is from package read then it has totally different new 
contract.
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}


was (Author: kyrdan):
Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>   

[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/14/20, 8:44 PM:
--

Hey guys [~cloud_fan] [~b...@cloudera.com]

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}


was (Author: kyrdan):
Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>

[jira] [Commented] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178052#comment-17178052
 ] 

Rafael commented on SPARK-25390:


Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

 

Here my migration plan can you highlight what interfaces should I use now

 

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}

class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of

 
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
 

I should use

 
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
 

right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25390) Data source V2 API refactoring

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178052#comment-17178052
 ] 

Rafael edited comment on SPARK-25390 at 8/14/20, 8:42 PM:
--

Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

Here my migration plan can you highlight what interfaces should I use now

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}
class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
I should use
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}


was (Author: kyrdan):
Hey guys, 

I'm trying to migrate the package where I was using *import 
org.apache.spark.sql.sources.v2._ into Spark3.0.0*

and haven't found a good guide so may I ask my questions here.

 

Here my migration plan can you highlight what interfaces should I use now

 

1. Cannot find what should I use instead of ReadSupport, ReadSupport, 
DataSourceReader,

if instead of ReadSupport we have to use now Scan then what happened to method 
createReader? 
{code:java}

class DefaultSource extends ReadSupport { 
  override def createReader(options: DataSourceOptions): DataSourceReader = new 
GeneratingReader() 
}
{code}
2. 

Here instead of

 
{code:java}
import org.apache.spark.sql.sources.v2.reader.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.sources.v2.reader.{InputPartition, 
SupportsReportPartitioning}
{code}
 

I should use

 
{code:java}
import org.apache.spark.sql.connector.read.partitioning.{Distribution, 
Partitioning}
import org.apache.spark.sql.connector.read.{InputPartition, 
SupportsReportPartitioning}
{code}
 

right?
{code:java}
class GeneratingReader() extends DataSourceReader {
  override def readSchema(): StructType = {...}
  override def planInputPartitions(): util.List[InputPartition[InternalRow]] = {
val partitions = new util.ArrayList[InputPartition[InternalRow]]()
...
partitions.add(new GeneratingInputPartition(...))
  }
  override def outputPartitioning(): Partitioning = {...}
}
{code}
3.

Haven't found what should I use instead of
{code:java}
import org.apache.spark.sql.sources.v2.reader.InputPartition
import org.apache.spark.sql.sources.v2.reader.InputPartitionReader{code}
Interface like it has totally different contract
{code:java}
import org.apache.spark.sql.connector.read.InputPartition{code}
{code:java}
class GeneratingInputPartition() extends InputPartition[InternalRow] {
  override def createPartitionReader(): InputPartitionReader[InternalRow] = new 
 GeneratingInputPartitionReader(...)
}

class GeneratingInputPartitionReader() extends 
InputPartitionReader[InternalRow] {
   override def next(): Boolean = ...
   override def get(): InternalRow = ...
   override def close(): Unit = ...
}
{code}

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: 

[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-08-14 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178042#comment-17178042
 ] 

Sean R. Owen commented on SPARK-26132:
--

Yes exactly. Unfortunately you have to do that. What that bought us was better 
backwards-compatibility in Spark 3.

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178035#comment-17178035
 ] 

Rafael commented on SPARK-26132:


Thank you [~srowen] yes it works.
{code:java}
val f: Iterator[Row] => Unit = (iterator: Iterator[Row]) => {}
 dataFrame.foreachPartition(f){code}

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178032#comment-17178032
 ] 

Apache Spark commented on SPARK-32609:
--

User 'mingjialiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29435

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178031#comment-17178031
 ] 

Apache Spark commented on SPARK-32609:
--

User 'mingjialiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29435

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-08-14 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178024#comment-17178024
 ] 

Sean R. Owen commented on SPARK-26132:
--

I think you need to cast your whole lambda function expression as 
{{(Iterator[T] => Unit)}} or similar.

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-08-14 Thread Rafael (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17178019#comment-17178019
 ] 

Rafael commented on SPARK-26132:


[~srowen]

In release notes for Spark 3.0.0 they mentioned your ticket
{quote}Due to the upgrade of Scala 2.12, {{DataStreamWriter.foreachBatch}} is 
not source compatible for Scala program. You need to update your Scala source 
code to disambiguate between Scala function and Java lambda. (SPARK-26132)
{quote}
 

so maybe you know how we should use *foreachPartition* now in Scala code
{code:java}
dataFrame.foreachPartition(partition => {
  partition
.grouped(Config.BATCH_SIZE)
.foreach(batch => { 
 
 } 
}
{code}
Right now it call on any method like grouped, foreach cause the exception 
*value grouped is not a member of Object*

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names

2020-08-14 Thread Pierre Gramme (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Gramme updated SPARK-32618:
--
Description: 
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

Seems related with SPARK-21791(which was solved in 2.3.0).

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct>'
{code}
 

 

  was:
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

Seems related with [#SPARK-21791] (which was solved in 2.3.0).

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct>'
{code}
 

 


> ORC writer doesn't support colon in column names
> 
>
> Key: SPARK-32618
> URL: https://issues.apache.org/jira/browse/SPARK-32618
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Pierre Gramme
>Priority: Major
>
> Hi,
> I'm getting an {{IllegalArgumentException: Can't parse category at 
> 'struct'}} when exporting to ORC a dataframe whose column names 
> contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
> the name with colon appears nested as member of a struct.
> Seems related with SPARK-21791(which was solved in 2.3.0).
> In my real-life case, the column was actually {{xsi:type}}, coming from some 
> parsed xml. Thus other users may be affected too.
> Has it been fixed after Spark 2.3.0? (sorry, can't test easily)
> Any workaround? Would be acceptable for me to find and replace all colons 
> with underscore in column names, but not easy to do in a big set of nested 
> struct columns...
> Thanks
>  
>  
> {code:java}
>  spark.conf.set("spark.sql.orc.impl", "native")
>  val dfColon = Seq(1).toDF("a:b")
>  dfColon.printSchema()
>  dfColon.show()
>  dfColon.write.orc("test_colon")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct'
>  
>  import org.apache.spark.sql.functions.struct
>  val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
>  dfColonStruct.printSchema()
>  dfColonStruct.show()
>  dfColon.write.orc("test_colon_struct")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct>'
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names

2020-08-14 Thread Pierre Gramme (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Gramme updated SPARK-32618:
--
Description: 
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

Seems related with [#SPARK-21791] (which was solved in 2.3.0).

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct>'
{code}
 

 

  was:
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct>'
{code}
 

 


> ORC writer doesn't support colon in column names
> 
>
> Key: SPARK-32618
> URL: https://issues.apache.org/jira/browse/SPARK-32618
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Pierre Gramme
>Priority: Major
>
> Hi,
> I'm getting an {{IllegalArgumentException: Can't parse category at 
> 'struct'}} when exporting to ORC a dataframe whose column names 
> contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
> the name with colon appears nested as member of a struct.
> Seems related with [#SPARK-21791] (which was solved in 2.3.0).
> In my real-life case, the column was actually {{xsi:type}}, coming from some 
> parsed xml. Thus other users may be affected too.
> Has it been fixed after Spark 2.3.0? (sorry, can't test easily)
> Any workaround? Would be acceptable for me to find and replace all colons 
> with underscore in column names, but not easy to do in a big set of nested 
> struct columns...
> Thanks
>  
>  
> {code:java}
>  spark.conf.set("spark.sql.orc.impl", "native")
>  val dfColon = Seq(1).toDF("a:b")
>  dfColon.printSchema()
>  dfColon.show()
>  dfColon.write.orc("test_colon")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct'
>  
>  import org.apache.spark.sql.functions.struct
>  val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
>  dfColonStruct.printSchema()
>  dfColonStruct.show()
>  dfColon.write.orc("test_colon_struct")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct>'
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32618) ORC writer doesn't support colon in column names

2020-08-14 Thread Pierre Gramme (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Gramme updated SPARK-32618:
--
Description: 
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct>'
{code}
 

 

  was:
Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
{code}
 

 


> ORC writer doesn't support colon in column names
> 
>
> Key: SPARK-32618
> URL: https://issues.apache.org/jira/browse/SPARK-32618
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
>Reporter: Pierre Gramme
>Priority: Major
>
> Hi,
> I'm getting an {{IllegalArgumentException: Can't parse category at 
> 'struct'}} when exporting to ORC a dataframe whose column names 
> contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
> the name with colon appears nested as member of a struct.
> In my real-life case, the column was actually {{xsi:type}}, coming from some 
> parsed xml. Thus other users may be affected too.
> Has it been fixed after Spark 2.3.0? (sorry, can't test easily)
> Any workaround? Would be acceptable for me to find and replace all colons 
> with underscore in column names, but not easy to do in a big set of nested 
> struct columns...
> Thanks
>  
>  
> {code:java}
>  spark.conf.set("spark.sql.orc.impl", "native")
>  val dfColon = Seq(1).toDF("a:b")
>  dfColon.printSchema()
>  dfColon.show()
>  dfColon.write.orc("test_colon")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct'
>  
>  import org.apache.spark.sql.functions.struct
>  val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
>  dfColonStruct.printSchema()
>  dfColonStruct.show()
>  dfColon.write.orc("test_colon_struct")
>  // Fails with IllegalArgumentException: Can't parse category at 
> 'struct>'
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32619) converting dataframe to dataset for the json schema

2020-08-14 Thread Manjay Kumar (Jira)
Manjay Kumar created SPARK-32619:


 Summary: converting dataframe to dataset for the json schema
 Key: SPARK-32619
 URL: https://issues.apache.org/jira/browse/SPARK-32619
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Manjay Kumar


have a schema 

 

{

Details :[{

phone : "98977999"

contacts: [{

name:"manjay"

-- has missing street block in json

]}

]}

 

}

 

Case class , based on schema

case class Details (

 phone : String,

contacts : Array[Adress]

)

 

case class Adress(

name : String

street : String

 

)

 

 

this throws : No such struct field street - Analysis exception.

 

dataframe.as[Details]

 

Is this a bug ?? or there is a resolution for this.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32618) ORC writer doesn't support colon in column names

2020-08-14 Thread Pierre Gramme (Jira)
Pierre Gramme created SPARK-32618:
-

 Summary: ORC writer doesn't support colon in column names
 Key: SPARK-32618
 URL: https://issues.apache.org/jira/browse/SPARK-32618
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.0
Reporter: Pierre Gramme


Hi,

I'm getting an {{IllegalArgumentException: Can't parse category at 
'struct'}} when exporting to ORC a dataframe whose column names 
contain colon ({{:}}). Reproducible as hereunder. Same problem also occurs if 
the name with colon appears nested as member of a struct.

In my real-life case, the column was actually {{xsi:type}}, coming from some 
parsed xml. Thus other users may be affected too.

Has it been fixed after Spark 2.3.0? (sorry, can't test easily)

Any workaround? Would be acceptable for me to find and replace all colons with 
underscore in column names, but not easy to do in a big set of nested struct 
columns...

Thanks

 

 
{code:java}
 spark.conf.set("spark.sql.orc.impl", "native")

 val dfColon = Seq(1).toDF("a:b")
 dfColon.printSchema()
 dfColon.show()
 dfColon.write.orc("test_colon")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
 
 import org.apache.spark.sql.functions.struct
 val dfColonStruct = dfColon.withColumn("x", struct($"a:b")).drop("a:b")
 dfColonStruct.printSchema()
 dfColonStruct.show()
 dfColon.write.orc("test_colon_struct")
 // Fails with IllegalArgumentException: Can't parse category at 
'struct'
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177820#comment-17177820
 ] 

Apache Spark commented on SPARK-32526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29434

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-and-aborted-20200806
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Description: 
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker java.util.NoSuchElementException: key not found: 
12 
at scala.collection.immutable.Map$Map1.apply(Map.scala:114) 
at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics including old ones during the execution, which will cause 
NoSuchElementException, since the metricsType is already updated with plan 
rewritten. So we need to filter out those outdated metrics.

  was:
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker java.util.NoSuchElementException: key not found: 
12 
at scala.collection.immutable.Map$Map1.apply(Map.scala:114) 
at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 

[jira] [Updated] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32616:
-
Issue Type: Improvement  (was: Bug)

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32616:
---

Assignee: wuyi

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32616.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29432
[https://github.com/apache/spark/pull/29432]

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177765#comment-17177765
 ] 

Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:24 PM:
-

About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those 
settings:
{code:java}
 spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py

{code}
I don't know they still work but personally I would close the ticket and not 
put this in the doc. I think it is not the right way to to it as it doesn't 
scale to 100s executors and can produce race conditions for the tasks running 
on the same executor (multiple pip installs at the same time on the same node)

 


was (Author: fhoering):
About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those 
settings:

{code}
 spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py

{code}


 I don't know they still work but personally I would close the ticket and not 
put this in the doc. I think it is not the right way to to it as it doens't 
scale to 100 executors and can produce race conditions for the task running on 
the same executor (multiple pip installs at the same time on the same node)

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177765#comment-17177765
 ] 

Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:23 PM:
-

About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those 
settings:

{code}
 spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py

{code}


 I don't know they still work but personally I would close the ticket and not 
put this in the doc. I think it is not the right way to to it as it doens't 
scale to 100 executors and can produce race conditions for the task running on 
the same executor (multiple pip installs at the same time on the same node)

 


was (Author: fhoering):
About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those 
settings:
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py
I don't know they still work but personally I would close the ticket and not 
put this in the doc. I think it is not the right way to to it as it doens't 
scale to 100 executors and can produce race conditions for the task running on 
the same executor (multiple pip installs at the same time on the same node)

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177765#comment-17177765
 ] 

Fabian Höring commented on SPARK-32187:
---

About this ticket: https://issues.apache.org/jira/browse/SPARK-13587 and those 
settings:
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py
I don't know they still work but personally I would close the ticket and not 
put this in the doc. I think it is not the right way to to it as it doens't 
scale to 100 executors and can produce race conditions for the task running on 
the same executor (multiple pip installs at the same time on the same node)

 

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177760#comment-17177760
 ] 

Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:10 PM:
-

[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time to have a look and give some feedback on the comments 
below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc and the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.


was (Author: fhoering):
[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time to have a look and give some feedback on the comments 
below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc the the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.

> User Guide - Shipping Python Package
> 
>
>   

[jira] [Comment Edited] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177760#comment-17177760
 ] 

Fabian Höring edited comment on SPARK-32187 at 8/14/20, 1:08 PM:
-

[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time to have a look and give some feedback on the comments 
below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc the the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.


was (Author: fhoering):
[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time have a look and give some feedback on the comments below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc the the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.

> User Guide - Shipping Python Package
> 
>
> 

[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177760#comment-17177760
 ] 

Fabian Höring commented on SPARK-32187:
---

[~hyukjin.kwon]
 I started working on it. The new doc looks pretty nice ! Thanks for the effort 
on this. 
 I think I can also write about py-files and zipped envs.

Here is a first (in progress) draft. I will make it consistent across the 
examples. All links target the current doc.
 
[https://github.com/fhoering/spark/commit/843b1caa27594bc4bc3cb9637da6f8695db66fbe]
 I will be in holidays for 2 weeks. So no progress will be done. It would be 
nice if you have time have a look and give some feedback on the comments below.

Some considerations:

It is structured around the vectorized udf example:
 - Using PEX
 - Using a zipped virtual environment
 - Using py files
 - What about the Spark jars ?

I references those external tools. I don't have any affiliation to those tools:
 - [https://github.com/pantsbuild/pex]
 - [https://conda.github.io/conda-pack/spark.html] => seems the only 
alternative for conda for now afaik
 - [https://jcristharif.com/venv-pack/spark.html] => it handles venv zip, 
personally I would recommend to use pex because it is self contained but for 
completeness I added it

I also referenced my docker spark standalone e2e example => I don't really want 
to promote my own stuff here but I think it could probably be helpful for 
people to have something running directly, the examples always strip some code, 
if you think it should not be there we can remove it. I don't mind also moving 
it to the spark repo.

Some stuff I'm not sure about:
{quote}The unzip will be done by Spark when using target ``--archives`` option 
in spark-submit 
 or setting ``spark.yarn.dist.archives`` configuration.
{quote}
I seems like there is no way to set the archives as a config param when not 
running on YARN. I checked the doc the the spark code. So it seems 
inconsistent. Can you check or confirm ?
{quote}It doesn't allow to add packages built as `Wheels 
<[https://www.python.org/dev/peps/pep-0427/]>`_ and therefore doesn't allowing 
to include dependencies with native code.
{quote}
I think it is the case but we need to check to be sure that it doesn't say 
something wrong. I can try by adding some wheel and see if it works.

There is maybe one sentence to say about docker also. Basically what is 
described here is the lightweight Python way to do it.

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177749#comment-17177749
 ] 

Apache Spark commented on SPARK-32526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29433

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-and-aborted-20200806
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177748#comment-17177748
 ] 

Apache Spark commented on SPARK-32526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29433

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: failed-and-aborted-20200806
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-08-14 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177708#comment-17177708
 ] 

Ramakrishna Prasad K S edited comment on SPARK-32234 at 8/14/20, 11:22 AM:
---

Thanks [~saurabhc100] I am going ahead and merging these changes to our product 
which is on Spark_3.0. I hope there is no regression or side effects due to 
these changes. Just wanted to know why this bug is still in resolved state. Is 
any test still pending to be run? Thank you.


was (Author: ramks):
Thanks [~saurabhc100] I am going ahead and merging these changes to my local 
Spark_3.0 setup. I hope there is no regression or side effects due to these 
changes. Just wanted to know why this bug is still in resolved state. Is any 
test still pending to be run? Thank you.

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
> Attachments: e17f6887c06d47f6a62c0140c1ad569c_00
>
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
> at 
> 

[jira] [Commented] (SPARK-32234) Spark sql commands are failing on select Queries for the orc tables

2020-08-14 Thread Ramakrishna Prasad K S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177708#comment-17177708
 ] 

Ramakrishna Prasad K S commented on SPARK-32234:


Thanks [~saurabhc100] I am going ahead and merging these changes to my local 
Spark_3.0 setup. I hope there is no regression or side effects due to these 
changes. Just wanted to know why this bug is still in resolved state. Is any 
test still pending to be run? Thank you.

> Spark sql commands are failing on select Queries for the  orc tables
> 
>
> Key: SPARK-32234
> URL: https://issues.apache.org/jira/browse/SPARK-32234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Saurabh Chawla
>Assignee: Saurabh Chawla
>Priority: Blocker
> Fix For: 3.0.1, 3.1.0
>
> Attachments: e17f6887c06d47f6a62c0140c1ad569c_00
>
>
> Spark sql commands are failing on select Queries for the orc tables
> Steps to reproduce
>  
> {code:java}
> val table = """CREATE TABLE `date_dim` (
>   `d_date_sk` INT,
>   `d_date_id` STRING,
>   `d_date` TIMESTAMP,
>   `d_month_seq` INT,
>   `d_week_seq` INT,
>   `d_quarter_seq` INT,
>   `d_year` INT,
>   `d_dow` INT,
>   `d_moy` INT,
>   `d_dom` INT,
>   `d_qoy` INT,
>   `d_fy_year` INT,
>   `d_fy_quarter_seq` INT,
>   `d_fy_week_seq` INT,
>   `d_day_name` STRING,
>   `d_quarter_name` STRING,
>   `d_holiday` STRING,
>   `d_weekend` STRING,
>   `d_following_holiday` STRING,
>   `d_first_dom` INT,
>   `d_last_dom` INT,
>   `d_same_day_ly` INT,
>   `d_same_day_lq` INT,
>   `d_current_day` STRING,
>   `d_current_week` STRING,
>   `d_current_month` STRING,
>   `d_current_quarter` STRING,
>   `d_current_year` STRING)
> USING orc
> LOCATION '/Users/test/tpcds_scale5data/date_dim'
> TBLPROPERTIES (
>   'transient_lastDdlTime' = '1574682806')"""
> spark.sql(table).collect
> val u = """select date_dim.d_date_id from date_dim limit 5"""
> spark.sql(u).collect
> {code}
>  
>  
> Exception
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 2, 192.168.0.103, executor driver): 
> java.lang.ArrayIndexOutOfBoundsException: 1
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:336)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:133)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  
> The reason behind this initBatch is not getting the schema that is needed to 
> find out the column value in OrcFileFormat.scala
> 

[jira] [Created] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.

2020-08-14 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-32617:
---

 Summary: Upgrade kubernetes client version to support latest 
minikube version.
 Key: SPARK-32617
 URL: https://issues.apache.org/jira/browse/SPARK-32617
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Prashant Sharma


Following error comes, when the k8s integration tests are run against the 
minikube cluster with version 1.2.1

{code:java}
Run starting. Expected test count is: 18
KubernetesSuite:
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
  io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
  at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
  at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
  at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196)
  at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62)
  at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51)
  at 
io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105)
  at 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81)
  at 
org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131)
  at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
  ...
  Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt
  at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
  at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
  at java.nio.file.Files.newByteChannel(Files.java:361)
  at java.nio.file.Files.newByteChannel(Files.java:407)
  at java.nio.file.Files.readAllBytes(Files.java:3152)
  at 
io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72)
  at 
io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242)
  at 
io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
  ...
Run completed in 1 second, 821 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 1
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
*** 1 SUITE ABORTED ***
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  4.454 s]
[INFO] Spark Project Tags . SUCCESS [  4.768 s]
[INFO] Spark Project Local DB . SUCCESS [  2.961 s]
[INFO] Spark Project Networking ... SUCCESS [  4.258 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.703 s]
[INFO] Spark Project Unsafe ... SUCCESS [  3.239 s]
[INFO] Spark Project Launcher . SUCCESS [  3.224 s]
[INFO] Spark Project Core . SUCCESS [02:25 min]
[INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 s]
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  03:12 min
[INFO] Finished at: 2020-08-11T06:26:15-05:00
[INFO] 
[ERROR] Failed to execute goal org.scalatest:scalatest-maven-plugin:2.0.0:test 
(integration-test) on project spark-kubernetes-integration-tests_2.12: There 
are test failures -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]  {code}

New minikube has support for profiles, which is simply enabled by upgrading the 
minikube version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177663#comment-17177663
 ] 

Apache Spark commented on SPARK-32616:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29432

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32616:


Assignee: Apache Spark

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32616:


Assignee: (was: Apache Spark)

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-32616:
-
Issue Type: Bug  (was: Improvement)

> Window operators should be added determinedly
> -
>
> Key: SPARK-32616
> URL: https://issues.apache.org/jira/browse/SPARK-32616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, the addWindow() method doesn't add window operators determinedly. 
> The same query could results in different plans (different window order) 
> because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32616) Window operators should be added determinedly

2020-08-14 Thread wuyi (Jira)
wuyi created SPARK-32616:


 Summary: Window operators should be added determinedly
 Key: SPARK-32616
 URL: https://issues.apache.org/jira/browse/SPARK-32616
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


Currently, the addWindow() method doesn't add window operators determinedly. 
The same query could results in different plans (different window order) 
because of it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-14 Thread Dustin Smith (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177584#comment-17177584
 ] 

Dustin Smith commented on SPARK-32046:
--

[~maropu] yes one would expect the current timestamp to be applied per 
dataframe for each call. My Jira ticket is about the fact that all dataframes 
get the same timestamp. However, this isn't the case. Once current timestamp is 
called once, that is the time. That is, even if we have a new dataframe with a 
new execution query and a new query plan, the time will be the from the first 
call in shell. On Jupyter and ZP, it will increment twice before freezing.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32590) Remove fullOutput from RowDataSourceScanExec

2020-08-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32590:
---

Assignee: Huaxin Gao

> Remove fullOutput from RowDataSourceScanExec
> 
>
> Key: SPARK-32590
> URL: https://issues.apache.org/jira/browse/SPARK-32590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> `RowDataSourceScanExec` requires the full output instead of the scan output 
> after column pruning. However, in v2 code path, we don't have the full output 
> anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` 
> is actually meaningless so we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32590) Remove fullOutput from RowDataSourceScanExec

2020-08-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32590.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29415
[https://github.com/apache/spark/pull/29415]

> Remove fullOutput from RowDataSourceScanExec
> 
>
> Key: SPARK-32590
> URL: https://issues.apache.org/jira/browse/SPARK-32590
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> `RowDataSourceScanExec` requires the full output instead of the scan output 
> after column pruning. However, in v2 code path, we don't have the full output 
> anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` 
> is actually meaningless so we should remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Description: 
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker java.util.NoSuchElementException: key not found: 
12 
at scala.collection.immutable.Map$Map1.apply(Map.scala:114) 
at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException

  was:
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys 

[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Description: 
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException

  was:
Reproduce Step
{code:java}
//代码占位符
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
//代码占位符
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 

[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-14 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177398#comment-17177398
 ] 

Takeshi Yamamuro edited comment on SPARK-32046 at 8/14/20, 7:28 AM:


This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce 
this issue...  The -Non-deterministic- current_timestamp expr can change output 
if cache broken in the case. So, I think it is not a good idea for applications 
to depend on those kinds of values, either way.


was (Author: maropu):
This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce 
this issue...  The -Non-deterministic- exprs can change output if cache broken 
in the case. So, I think it is not a good idea for applications to depend on 
those kinds of values, either way.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-14 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177562#comment-17177562
 ] 

Takeshi Yamamuro edited comment on SPARK-32046 at 8/14/20, 7:28 AM:


>> The question when does a query evaluation start and stop? Do mutual 
>> exclusive dataframes being processed consist of the same query evaluation? 
>> If yes, then current timestamp's behavior in spark shell is correct; 
>> however, as user, that would be extremely undesirable behavior. I would 
>> rather cache the current timestamp and call it again for a new time.

The evaluation of current_timestamp happens per dataframe just before invoking 
Spark jobs (more specifically, its done at the optimization stage in a driver 
side).

>> Now if a query evaluation stops once it is executed and starts anew when 
>> another dataframe or action is called, then the behavior in shell and 
>> notebooks are incorrect. The notebooks are only correct for a few runs and 
>> then default to not changing.

In normal cases, I think the behaviour of spark-shell is correct. But, I'm not 
sure what's going on ZP/Jupyter. If you want to make it robust, I think its 
better to use checkpoint instead of cache though.


was (Author: maropu):
>> The question when does a query evaluation start and stop? Do mutual 
>> exclusive dataframes being processed consist of the same query evaluation? 
>> If yes, then current timestamp's behavior in spark shell is correct; 
>> however, as user, that would be extremely undesirable behavior. I would 
>> rather cache the current timestamp and call it again for a new time.

The evaluation of current_timestamp happens per dataframe just before invoking 
Spark jobs (more specifically, its done at the optimization stage in a driver 
side).

>> Now if a query evaluation stops once it is executed and starts anew when 
>> another dataframe or action is called, then the behavior in shell and 
>> notebooks are incorrect. The notebooks are only correct for a few runs and 
>> then default to not changing.

In normal cases, I think the behaviour of spark-shell is correct. But, I'm not 
sure what's going on ZP/Jupyter. If you want to make it robust, I think its 
better to use checkpoint instead of cache though.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-14 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177398#comment-17177398
 ] 

Takeshi Yamamuro edited comment on SPARK-32046 at 8/14/20, 7:25 AM:


This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce 
this issue...  The -Non-deterministic- exprs can change output if cache broken 
in the case. So, I think it is not a good idea for applications to depend on 
those kinds of values, either way.


was (Author: maropu):
This issue depends on ZP or Jupyter? If so, I think it is hard to reproduce 
this issue...  Non-deterministic exprs can change output if cache broken in the 
case. So, I think it is not a good idea for applications to depend on those 
kinds of values, either way.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32046) current_timestamp called in a cache dataframe freezes the time for all future calls

2020-08-14 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177562#comment-17177562
 ] 

Takeshi Yamamuro commented on SPARK-32046:
--

>> The question when does a query evaluation start and stop? Do mutual 
>> exclusive dataframes being processed consist of the same query evaluation? 
>> If yes, then current timestamp's behavior in spark shell is correct; 
>> however, as user, that would be extremely undesirable behavior. I would 
>> rather cache the current timestamp and call it again for a new time.

The evaluation of current_timestamp happens per dataframe just before invoking 
Spark jobs (more specifically, its done at the optimization stage in a driver 
side).

>> Now if a query evaluation stops once it is executed and starts anew when 
>> another dataframe or action is called, then the behavior in shell and 
>> notebooks are incorrect. The notebooks are only correct for a few runs and 
>> then default to not changing.

In normal cases, I think the behaviour of spark-shell is correct. But, I'm not 
sure what's going on ZP/Jupyter. If you want to make it robust, I think its 
better to use checkpoint instead of cache though.

> current_timestamp called in a cache dataframe freezes the time for all future 
> calls
> ---
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.4, 3.0.0
>Reporter: Dustin Smith
>Priority: Minor
>  Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in 
> order to freeze that dataframe's time, the 3rd dataframe time and beyond 
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe 
> and the 2nd will differ in time but will become static on the 3rd usage and 
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd. 
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times 
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces 
> different results. In the shell, you only get 1 unique time no matter how 
> many times you run it, current_timestamp. However, in ZP or Jupyter I have 
> always received 2 unique times before it froze.
>  
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count 
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache 
> df3.count 
> df3.show(false){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32615:


Assignee: (was: Apache Spark)

> Fix AQE aggregateMetrics java.util.NoSuchElementException
> -
>
> Key: SPARK-32615
> URL: https://issues.apache.org/jira/browse/SPARK-32615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> Reproduce Step
> {code:java}
> //代码占位符
> sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite 
> -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is 
> EmptyHashedRelationWithAllNullKeys"
> {code}
> {code:java}
> //代码占位符
> 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
> Uncaught exception in thread 
> element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
> 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
> scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
> scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
> scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
>  at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
> when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
> milliseconds)
> {code}
> This issue is mainly because during AQE, while sub-plan changed, the metrics 
> update is overwrite. for example, in this UT, change from 
> BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd 
> action it will try aggregate all metrics during the execution, which will 
> cause NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32615:


Assignee: Apache Spark

> Fix AQE aggregateMetrics java.util.NoSuchElementException
> -
>
> Key: SPARK-32615
> URL: https://issues.apache.org/jira/browse/SPARK-32615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Minor
>
> Reproduce Step
> {code:java}
> //代码占位符
> sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite 
> -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is 
> EmptyHashedRelationWithAllNullKeys"
> {code}
> {code:java}
> //代码占位符
> 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
> Uncaught exception in thread 
> element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
> 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
> scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
> scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
> scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
>  at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
> when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
> milliseconds)
> {code}
> This issue is mainly because during AQE, while sub-plan changed, the metrics 
> update is overwrite. for example, in this UT, change from 
> BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd 
> action it will try aggregate all metrics during the execution, which will 
> cause NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177551#comment-17177551
 ] 

Apache Spark commented on SPARK-32615:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29431

> Fix AQE aggregateMetrics java.util.NoSuchElementException
> -
>
> Key: SPARK-32615
> URL: https://issues.apache.org/jira/browse/SPARK-32615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> Reproduce Step
> {code:java}
> //代码占位符
> sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite 
> -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is 
> EmptyHashedRelationWithAllNullKeys"
> {code}
> {code:java}
> //代码占位符
> 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
> Uncaught exception in thread 
> element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
> 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
> scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
> scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
> scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
>  at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
> when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
> milliseconds)
> {code}
> This issue is mainly because during AQE, while sub-plan changed, the metrics 
> update is overwrite. for example, in this UT, change from 
> BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd 
> action it will try aggregate all metrics during the execution, which will 
> cause NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Summary: Fix AQE aggregateMetrics java.util.NoSuchElementException  (was: 
AQE aggregateMetrics java.util.NoSuchElementException)

> Fix AQE aggregateMetrics java.util.NoSuchElementException
> -
>
> Key: SPARK-32615
> URL: https://issues.apache.org/jira/browse/SPARK-32615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> Reproduce Step
> {code:java}
> //代码占位符
> sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite 
> -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is 
> EmptyHashedRelationWithAllNullKeys"
> {code}
> {code:java}
> //代码占位符
> 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
> Uncaught exception in thread 
> element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
> 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
> scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
> scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
> scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
>  at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
> when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
> milliseconds)
> {code}
> This issue is mainly because during AQE, while sub-plan changed, the metrics 
> update is overwrite. for example, in this UT, change from 
> BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd 
> action it will try aggregate all metrics during the execution, which will 
> cause NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32615) AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32615:
---

 Summary: AQE aggregateMetrics java.util.NoSuchElementException
 Key: SPARK-32615
 URL: https://issues.apache.org/jira/browse/SPARK-32615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


Reproduce Step
{code:java}
//代码占位符
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
//代码占位符
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32609:


Assignee: Apache Spark

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177528#comment-17177528
 ] 

Apache Spark commented on SPARK-32609:
--

User 'mingjialiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29430

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32609) Incorrect exchange reuse with DataSourceV2

2020-08-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32609:


Assignee: (was: Apache Spark)

> Incorrect exchange reuse with DataSourceV2
> --
>
> Key: SPARK-32609
> URL: https://issues.apache.org/jira/browse/SPARK-32609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Mingjia Liu
>Priority: Major
>  Labels: correctness
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
>  
> {code:java}
> spark.conf.set("spark.sql.exchange.reuse","true")
> spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").option("table",'tpcds_1G.date_dim').load()
> df.createOrReplaceTempView(table)
> 
> df = spark.sql(""" 
> WITH t1 AS (
> SELECT 
> d_year, d_month_seq
> FROM (
> SELECT t1.d_year , t2.d_month_seq  
> FROM 
> date_dim t1
> cross join
> date_dim t2
> where t1.d_day_name = "Monday" and t1.d_fy_year > 2000
> and t2.d_day_name = "Monday" and t2.d_fy_year > 2000
> )
> GROUP BY d_year, d_month_seq)
>
>  SELECT
> prev_yr.d_year AS prev_year, curr_yr.d_year AS year, curr_yr.d_month_seq
>  FROM t1 curr_yr cross join t1 prev_yr
>  WHERE curr_yr.d_year=2002 AND prev_yr.d_year=2002-1
>  ORDER BY d_month_seq
>  LIMIT 100
>  
>  """)
> df.explain()
> df.show(){code}
>  
> the repro query :
> A. defines a temp table t1  
> B. cross join t1 (year 2002)  and  t2 (year 2001)
> With reuse exchange enabled, the plan incorrectly "decides" to re-use 
> persisted shuffle writes of A filtering on year 2002 , for year 2001.
> {code:java}
> == Physical Plan ==
> TakeOrderedAndProject(limit=100, orderBy=[d_month_seq#24371L ASC NULLS 
> FIRST], output=[prev_year#24366L,year#24367L,d_month_seq#24371L])
> +- *(9) Project [d_year#24402L AS prev_year#24366L, d_year#23551L AS 
> year#24367L, d_month_seq#24371L]
>+- CartesianProduct
>   :- *(4) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :  +- Exchange hashpartitioning(d_year#23551L, d_month_seq#24371L, 200)
>   : +- *(3) HashAggregate(keys=[d_year#23551L, d_month_seq#24371L], 
> functions=[])
>   :+- BroadcastNestedLoopJoin BuildRight, Cross
>   :   :- *(1) Project [d_year#23551L]
>   :   :  +- *(1) ScanV2 BigQueryDataSourceV2[d_year#23551L] 
> (Filters: [isnotnull(d_day_name#23559), (d_day_name#23559 = Monday), 
> isnotnull(d_fy_year#23556L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   :   +- BroadcastExchange IdentityBroadcastMode
>   :  +- *(2) Project [d_month_seq#24371L]
>   : +- *(2) ScanV2 
> BigQueryDataSourceV2[d_month_seq#24371L] (Filters: 
> [isnotnull(d_day_name#24382), (d_day_name#24382 = Monday), 
> isnotnull(d_fy_year#24379L), (d_fy_yea..., Options: 
> [table=tpcds_1G.date_dim,paths=[]])
>   +- *(8) HashAggregate(keys=[d_year#24402L, d_month_seq#24455L], 
> functions=[])
>  +- ReusedExchange [d_year#24402L, d_month_seq#24455L], Exchange 
> hashpartitioning(d_year#23551L, d_month_seq#24371L, 200){code}
>  
>  
> And the result is obviously incorrect because prev_year should be 2001
> {code:java}
> +-++---+
> |prev_year|year|d_month_seq|
> +-++---+
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> | 2002|2002|   1212|
> +-++---+
> only showing top 20 rows
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org