date:20200719

[jira] [Created] (SPARK-32364) `path` argument of DataFrame.load/save should override the existing options

2020-07-19 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-32364:
-

 Summary: `path` argument of DataFrame.load/save should override 
the existing options
 Key: SPARK-32364
 URL: https://issues.apache.org/jira/browse/SPARK-32364
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.6, 2.3.4, 2.2.3, 2.1.3, 2.0.2
Reporter: Dongjoon Hyun


{code}
spark.read
  .option("paTh", "1")
  .option("PATH", "2")
  .option("Path", "3")
  .option("patH", "4")
  .parquet("5")
{code}

{code}
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/dongjoon/APACHE/spark-release/spark-3.0.0-bin-hadoop3.2/1;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32310) ML params default value parity

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160907#comment-17160907
 ] 

Apache Spark commented on SPARK-32310:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29159

> ML params default value parity
> --
>
> Key: SPARK-32310
> URL: https://issues.apache.org/jira/browse/SPARK-32310
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Make sure ML has the same default param values between estimator and its 
> corresponding transformer, and also between Scala and Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32310) ML params default value parity

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160905#comment-17160905
 ] 

Apache Spark commented on SPARK-32310:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29159

> ML params default value parity
> --
>
> Key: SPARK-32310
> URL: https://issues.apache.org/jira/browse/SPARK-32310
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Make sure ML has the same default param values between estimator and its 
> corresponding transformer, and also between Scala and Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160888#comment-17160888
 ] 

Apache Spark commented on SPARK-32363:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29117

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160887#comment-17160887
 ] 

Apache Spark commented on SPARK-32363:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29117

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32363:


Assignee: (was: Apache Spark)

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32363:


Assignee: Apache Spark

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-19 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32363:


 Summary: Flaky pip installation test in Jenkins
 Key: SPARK-32363
 URL: https://issues.apache.org/jira/browse/SPARK-32363
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Currently pip packaging test is flaky in Jenkins:

{code}
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
Found existing installation: py4j 0.10.9
Uninstalling py4j-0.10.9:
  Successfully uninstalled py4j-0.10.9
  Attempting uninstall: pyspark
Found existing installation: pyspark 3.1.0.dev0
ERROR: Exception:
Traceback (most recent call last):
  File 
"/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
 line 188, in _main
status = self.run(options, args)
  File 
"/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
 line 185, in wrapper
return func(self, options, args)
  File 
"/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
 line 407, in run
use_user_site=options.use_user_site,
  File 
"/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
 line 64, in install_given_reqs
auto_confirm=True
  File 
"/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
 line 675, in uninstall
uninstalled_pathset = UninstallPathSet.from_dist(dist)
  File 
"/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
 line 545, in from_dist
link_pointer, dist.project_name, dist.location)
AssertionError: Egg-link 
/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
not match installed location of pyspark (at 
/home/jenkins/workspace/SparkPullRequestBuilder@2/python)
Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20629) Copy shuffle data when nodes are being shut down

2020-07-19 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-20629.
--
Fix Version/s: 3.1.0
 Assignee: Holden Karau
   Resolution: Fixed

> Copy shuffle data when nodes are being shut down
> 
>
> Key: SPARK-20629
> URL: https://issues.apache.org/jira/browse/SPARK-20629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> We decided not to do this for YARN, but for Kubernetes and similar systems 
> nodes may be shut down entirely without the ability to keep an 
> AuxiliaryService around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32362:
---
Description: 
{code}
QueryTest.sameRows(result.toSeq, df.collect().toSeq)
{code}
Even the results are different, no fail.

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> {code}
> QueryTest.sameRows(result.toSeq, df.collect().toSeq)
> {code}
> Even the results are different, no fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160880#comment-17160880
 ] 

Apache Spark commented on SPARK-32362:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29158

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> {code}
> QueryTest.sameRows(result.toSeq, df.collect().toSeq)
> {code}
> Even the results are different, no fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32362:


Assignee: (was: Apache Spark)

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32362:


Assignee: Apache Spark

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160879#comment-17160879
 ] 

Apache Spark commented on SPARK-32362:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29158

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32362) AdaptiveQueryExecSuite misses verifying AE results

2020-07-19 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32362:
---
Summary: AdaptiveQueryExecSuite misses verifying AE results  (was: 
AdaptiveQueryExecSuite has problem)

> AdaptiveQueryExecSuite misses verifying AE results
> --
>
> Key: SPARK-32362
> URL: https://issues.apache.org/jira/browse/SPARK-32362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32362) AdaptiveQueryExecSuite has problem

2020-07-19 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-32362:
--

 Summary: AdaptiveQueryExecSuite has problem
 Key: SPARK-32362
 URL: https://issues.apache.org/jira/browse/SPARK-32362
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32344) Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160865#comment-17160865
 ] 

Apache Spark commented on SPARK-32344:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29157

> Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates
> 
>
> Key: SPARK-32344
> URL: https://issues.apache.org/jira/browse/SPARK-32344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> {code}
> scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
> ...
> Caused by: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: false#37
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Lantao Jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160859#comment-17160859
 ] 

Lantao Jin commented on SPARK-32347:


Duplicates to SPARK-32237

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32257) [SQL Parser] Report Error for invalid usage of SET command

2020-07-19 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160854#comment-17160854
 ] 

jiaan.geng commented on SPARK-32257:


[~maropu]Thanks!

> [SQL Parser] Report Error for invalid usage of SET command
> --
>
> Key: SPARK-32257
> URL: https://issues.apache.org/jira/browse/SPARK-32257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> {code:java}
> SET spark.sql.ansi.enabled true{code}
>  The above SQL command does not change the conf value and it just tries to 
> display the value of conf "spark.sql.ansi.enabled true".
> We can disallow using the space in the conf name and issue a user friendly 
> error instead. In the error message, we should tell users a workaround to use 
> the quote if they still needs to specify a conf with a space. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2020-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29157:


Assignee: Maciej Szymkiewicz

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2020-07-19 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29157.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27331
[https://github.com/apache/spark/pull/27331]

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 3.1.0
>
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32361) Support collapse project with case Aggregate(Project)

2020-07-19 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-32361:

Summary: Support collapse project with case Aggregate(Project)  (was: 
Support collapse project with case Aggregate(_, _, Project))

> Support collapse project with case Aggregate(Project)
> -
>
> Key: SPARK-32361
> URL: https://issues.apache.org/jira/browse/SPARK-32361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32361) Support collapse project with case Aggregate(_, _, Project)

2020-07-19 Thread ulysses you (Jira)

ulysses you created SPARK-32361:
---

 Summary: Support collapse project with case Aggregate(_, _, 
Project)
 Key: SPARK-32361
 URL: https://issues.apache.org/jira/browse/SPARK-32361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32360) Add MaxMinBy to support eliminate sorts

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160841#comment-17160841
 ] 

Apache Spark commented on SPARK-32360:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29142

> Add MaxMinBy to support eliminate sorts
> ---
>
> Key: SPARK-32360
> URL: https://issues.apache.org/jira/browse/SPARK-32360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32360) Add MaxMinBy to support eliminate sorts

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160839#comment-17160839
 ] 

Apache Spark commented on SPARK-32360:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29142

> Add MaxMinBy to support eliminate sorts
> ---
>
> Key: SPARK-32360
> URL: https://issues.apache.org/jira/browse/SPARK-32360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32360) Add MaxMinBy to support eliminate sorts

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32360:


Assignee: (was: Apache Spark)

> Add MaxMinBy to support eliminate sorts
> ---
>
> Key: SPARK-32360
> URL: https://issues.apache.org/jira/browse/SPARK-32360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32360) Add MaxMinBy to support eliminate sorts

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32360:


Assignee: Apache Spark

> Add MaxMinBy to support eliminate sorts
> ---
>
> Key: SPARK-32360
> URL: https://issues.apache.org/jira/browse/SPARK-32360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32360) Add MaxMinBy to support eliminate sorts

2020-07-19 Thread ulysses you (Jira)

ulysses you created SPARK-32360:
---

 Summary: Add MaxMinBy to support eliminate sorts
 Key: SPARK-32360
 URL: https://issues.apache.org/jira/browse/SPARK-32360
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up

2020-07-19 Thread Gerard Alexander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160831#comment-17160831
 ] 

Gerard Alexander commented on SPARK-24156:
--

Please see 
[https://stackoverflow.com/questions/62915922/spark-structured-streaming-wont-pull-the-final-batch-from-kafka]
  has the issue crept back in possibly?

> Enable no-data micro batches for more eager streaming state clean up 
> -
>
> Key: SPARK-24156
> URL: https://issues.apache.org/jira/browse/SPARK-24156
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, MicroBatchExecution in Structured Streaming runs batches only when 
> there is new data to process. This is sensible in most cases as we dont want 
> to unnecessarily use resources when there is nothing new to process. However, 
> in some cases of stateful streaming queries, this delays state clean up as 
> well as clean-up based output. For example, consider a streaming aggregation 
> query with watermark-based state cleanup. The watermark is updated after 
> every batch with new data completes. The updated value is used in the next 
> batch to clean up state, and output finalized aggregates in append mode. 
> However, if there is no data, then the next batch does not occur, and 
> cleanup/output gets delayed unnecessarily. This is true for all stateful 
> streaming operators - aggregation, deduplication, joins, mapGroupsWithState
> This issue tracks the work to enable no-data batches in MicroBatchExecution. 
> The major challenge is that all the tests of relevant stateful operations add 
> dummy data to force another batch for testing the state cleanup. So a lot of 
> the tests are going to be changed. So my plan is to enable no-data batches 
> for different stateful operators one at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

2020-07-19 Thread Gerard Alexander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160820#comment-17160820
 ] 

Gerard Alexander edited comment on SPARK-24815 at 7/19/20, 9:23 PM:


Thanks, but I get that; but it is hard to follow here. That article states SSS 
is being run with dynamic resources. But if I click thru here and read the 
comments I get the impression there in no batch algorithm support either for 
SSS. What is the status? I'm on holidays so less as access to quality resources 
to check out.


was (Author: thebluephantom):
Thanks, but I get that; but it is hard to follow here. That article states SSS 
is being run with dynamic resources. But if I click thru here and read the 
comments I get the impression there in no batch algorithm support either for 
SSS. What is the status?

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2020-07-19 Thread Gerard Alexander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160820#comment-17160820
 ] 

Gerard Alexander commented on SPARK-24815:
--

Thanks, but I get that; but it is hard to follow here. That article states SSS 
is being run with dynamic resources. But if I click thru here and read the 
comments I get the impression there in no batch algorithm support either for 
SSS. What is the status?

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2020-07-19 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160818#comment-17160818
 ] 

Jungtaek Lim commented on SPARK-24815:
--

I don't think the goal of the issue is only about totally idle vs query 
running. The goal is to reduce down the necessary resource if there's less 
input data being ingested and bring up more if there's more input data being 
ingested, smoothly. If the input data is ingested from the real product most 
probably it should have peak time for the day or expected/unexpected peak on 
some events.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

2020-07-19 Thread Gerard Alexander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160816#comment-17160816
 ] 

Gerard Alexander edited comment on SPARK-24815 at 7/19/20, 8:45 PM:


When I read this, I am not sure what is being said. This 
[https://dzone.com/articles/spark-dynamic-allocation]  states and shows that 
Spark Structured Streaming does this capability of relinquishing or getting new 
resources up till maximum setting that has been set with the job. Can you 
clarify please? The issue seems to be that the batch approach is applied to a 
non-batch situation. May it was Spark Streaming and not Spark Structured 
Streaming - but the descriptions are unclear. 


was (Author: thebluephantom):
When I read this, I am not sure what is being said. This 
[https://dzone.com/articles/spark-dynamic-allocation]  states and shows that 
Spark Structured Streaming does this capability of relinquishing or getting new 
resources up till maximum setting that has been set with the job. Can you 
clarify please? The issue seems to be that the batch approach is applied to a 
non-batch situation.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2020-07-19 Thread Gerard Alexander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160816#comment-17160816
 ] 

Gerard Alexander commented on SPARK-24815:
--

When I read this, I am not sure what is being said. This 
[https://dzone.com/articles/spark-dynamic-allocation]  states and shows that 
Spark Structured Streaming does this capability of relinquishing or getting new 
resources up till maximum setting that has been set with the job. Can you 
clarify please? The issue seems to be that the batch approach is applied to a 
non-batch situation.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32359) Implement max_error metric evaluator for spark regression mllib

2020-07-19 Thread Sayed Mohammad Hossein Torabi (Jira)

Sayed Mohammad Hossein Torabi created SPARK-32359:
-

 Summary: Implement max_error metric evaluator for spark regression 
mllib
 Key: SPARK-32359
 URL: https://issues.apache.org/jira/browse/SPARK-32359
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 3.0.0
Reporter: Sayed Mohammad Hossein Torabi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32276) Remove redundant sorts before repartition nodes

2020-07-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32276.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29089
[https://github.com/apache/spark/pull/29089]

> Remove redundant sorts before repartition nodes
> ---
>
> Key: SPARK-32276
> URL: https://issues.apache.org/jira/browse/SPARK-32276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.1.0
>
>
> I think our {{EliminateSorts}} rule can be extended further to remove sorts 
> before repartition, repartitionByExpression and coalesce nodes. Independently 
> of whether we do a shuffle or not, each repartition operation will change the 
> ordering and distribution of data.
> That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as 
> {{Repartition -> Scan}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32276) Remove redundant sorts before repartition nodes

2020-07-19 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32276:
-

Assignee: Anton Okolnychyi

> Remove redundant sorts before repartition nodes
> ---
>
> Key: SPARK-32276
> URL: https://issues.apache.org/jira/browse/SPARK-32276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> I think our {{EliminateSorts}} rule can be extended further to remove sorts 
> before repartition, repartitionByExpression and coalesce nodes. Independently 
> of whether we do a shuffle or not, each repartition operation will change the 
> ordering and distribution of data.
> That's why we should be able to rewrite {{Repartition -> Sort -> Scan}} as 
> {{Repartition -> Scan}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-07-19 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160786#comment-17160786
 ] 

Chen Zhang commented on SPARK-32317:


Spark uses requiredSchema to convert the read parquet file data. It does not 
consider that the requiredSchema and the schema stored in the file may be 
different. This may potentially cause data correctness problems.

I think it is necessary to consider the mapping between the requiredSchema and 
the schema stored in the file. I will try to modify the code, and if it goes 
well, I can submit a PR.

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
>

[jira] [Comment Edited] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-07-19 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160207#comment-17160207
 ] 

Chen Zhang edited comment on SPARK-32317 at 7/19/20, 6:40 PM:
--

The DECIMAL type in Parquet is stored by INT32, INT64, FIXED_LEN_BYTE_ARRAY, or 
BINARY.

Taking 19500.00 stored as INT64 as an example, the parquet file is stored as 
DECIMAL(15,2) -> 195, DECIMAL(15,6) -> 195.

 


was (Author: chen zhang):
The DECIMAL type in Parquet is stored by INT32, INT64, FIXED_LEN_BYTE_ARRAY, or 
BINARY.

Taking 19500.00 stored as INT64 as an example, the parquet file is stored as 
DECIMAL(15,2) -> 195, DECIMAL(15,6) -> 195.

Spark uses requiredSchema to convert the read parquet file data. It does not 
consider that the requiredSchema and the schema stored in the file may be 
different. This may potentially cause data correctness problems.

I think it is necessary to consider the mapping between the requiredSchema and 
the schema stored in the file. I will try to modify the code, and if it goes 
well, I can submit a PR.

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
>

[jira] [Comment Edited] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-07-19 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160207#comment-17160207
 ] 

Chen Zhang edited comment on SPARK-32317 at 7/19/20, 6:38 PM:
--

The DECIMAL type in Parquet is stored by INT32, INT64, FIXED_LEN_BYTE_ARRAY, or 
BINARY.

Taking 19500.00 stored as INT64 as an example, the parquet file is stored as 
DECIMAL(15,2) -> 195, DECIMAL(15,6) -> 195.

Spark uses requiredSchema to convert the read parquet file data. It does not 
consider that the requiredSchema and the schema stored in the file may be 
different. This may potentially cause data correctness problems.

I think it is necessary to consider the mapping between the requiredSchema and 
the schema stored in the file. I will try to modify the code, and if it goes 
well, I can submit a PR.


was (Author: chen zhang):
The DECIMAL type in Parquet is stored by INT32, INT64, FIXED_LEN_BYTE_ARRAY, or 
BINARY.

Taking 19500.00 stored as INT64 as an example, the parquet file is stored as 
DECIMAL(15,2) -> 195, DECIMAL(15,6) -> 195.

Spark uses requiredSchema to convert the read parquet file data. It does not 
consider that the requiredSchema and the schema stored in the file may be 
different. This may potentially cause data correctness problems.

I think it is necessary to consider the mapping between the requiredSchema and 
the schema stored in the file. I will try to modify the code, and if it goes 
well, I can submit a PR.

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>

[jira] [Comment Edited] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-07-19 Thread Chen Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160207#comment-17160207
 ] 

Chen Zhang edited comment on SPARK-32317 at 7/19/20, 6:10 PM:
--

The DECIMAL type in Parquet is stored by INT32, INT64, FIXED_LEN_BYTE_ARRAY, or 
BINARY.

Taking 19500.00 stored as INT64 as an example, the parquet file is stored as 
DECIMAL(15,2) -> 195, DECIMAL(15,6) -> 195.

Spark uses requiredSchema to convert the read parquet file data. It does not 
consider that the requiredSchema and the schema stored in the file may be 
different. This may potentially cause data correctness problems.

I think it is necessary to consider the mapping between the requiredSchema and 
the schema stored in the file. I will try to modify the code, and if it goes 
well, I can submit a PR.


was (Author: chen zhang):
The DECIMAL type in Parquet is stored by INT32, INT64, FIXED_LEN_BYTE_ARRAY, or 
BINARY.

Taking 19500.00 stored as INT64 as an example, the parquet file is stored as 
DECIMAL(15,2) -> 195, DECIMAL(15,6) -> 195.

I see that the spark source code 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter use 
catalystType to convert the decimal type. Maybe we should use the schema in the 
parquet file to convert.

something like:
{code:java}
//case t: DecimalType if parquetType.asPrimitiveType().getPrimitiveTypeName == 
INT64 =>
//  new ParquetIntDictionaryAwareDecimalConverter(t.precision, t.scale, updater)
case t: DecimalType if parquetType.asPrimitiveType().getPrimitiveTypeName == 
INT64 =>
  val mate = parquetType.asPrimitiveType().getDecimalMetadata()
  new ParquetLongDictionaryAwareDecimalConverter(mate.getPrecision, 
mate.getScale, updater)
{code}
I will do some validation later. 

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at

[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-07-19 Thread Krish (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160776#comment-17160776
 ] 

Krish commented on SPARK-32317:
---

Thanks for the update. The problem here I see is given below with examples
 # No one by knowingly will give wrong schema, as we all know we built systems 
but later those systems will maintained by multiple persons.
 # We use these systems for big data processing, a record might go through 
multiple updates for a better cause, so if one of the person who is maintaining 
the system because of some typo the configuration got created with D(15,6) 
instead of D(15,2) for amount. And if that config used for creating an update 
for older records, that creates the problem. So with this, all the previous 
records written to older date partitions having the data with schema D(15,2) 
and the new updated file came with D(15,6). 
 # We just take the older record and new record to two different data frames 
and compare field by field, we wont see the issue and the data here loaded in 
to two different data frames and all checks pass through.
 # By default spark says merge schema false, so we just loads the data and 
assume the data is getting loaded properly and hence data will be proper, but 
not the case. 

This is exactly happened in our case. I feel there should have been some better 
way for handling this kind of issue!

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>

[jira] [Updated] (SPARK-32358) temp view not working after upgrading from 2.3.3 to 2.4.5

2020-07-19 Thread philipse (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-32358:
-
Description: 
After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . 
Please correct me if i miss sth. Thanks!

Steps to reproduce:

```

from pyspark.sql import SparkSession
 from pyspark.sql import Row
 spark=SparkSession\
 .builder \
 .appName('scenary_address_1') \
 .enableHiveSupport() \
 .getOrCreate()
 
address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
 print("create dataframe finished")
 address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
 print(spark.read.table('scenery_address_test1').dtypes)
 spark.sql("select * from scenery_address_test1").show()

```

 

In spark2.3.3  I can easily gey the following result:

```

create dataframe finished
 [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
 +-+-++---
|a|b|c|

+-+-++---
|1|难|80|
|2|v|81|

+-+-+—+

```

 

But in 2.4.5. i can only get,but without result showing out:

create dataframe finished
 [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]

  was:
After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . 
Please correct me if i miss sth. Thanks!

Steps to reproduce:

```

from pyspark.sql import SparkSession
 from pyspark.sql import Row
 spark=SparkSession\
 .builder \
 .appName('scenary_address_1') \
 .enableHiveSupport() \
 .getOrCreate()
 
address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
 print("create dataframe finished")
 address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
 print(spark.read.table('scenery_address_test1').dtypes)
 spark.sql("select * from scenery_address_test1").show()

```

 

In spark2.3.3  i can easily gey the following result:

```

create dataframe finished
 [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
 ++--++---
|a|b|c|

++--++---
|1|难|80|
|2|v|81|

++--+—+

```

 

But in 2.4.5. i can only get:

create dataframe finished
 [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]


> temp view not working after upgrading from 2.3.3 to 2.4.5
> -
>
> Key: SPARK-32358
> URL: https://issues.apache.org/jira/browse/SPARK-32358
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: philipse
>Priority: Major
>
> After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . 
> Please correct me if i miss sth. Thanks!
> Steps to reproduce:
> ```
> from pyspark.sql import SparkSession
>  from pyspark.sql import Row
>  spark=SparkSession\
>  .builder \
>  .appName('scenary_address_1') \
>  .enableHiveSupport() \
>  .getOrCreate()
>  
> address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
>  print("create dataframe finished")
>  address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
>  print(spark.read.table('scenery_address_test1').dtypes)
>  spark.sql("select * from scenery_address_test1").show()
> ```
>  
> In spark2.3.3  I can easily gey the following result:
> ```
> create dataframe finished
>  [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
>  +-+-++---
> |a|b|c|
> +-+-++---
> |1|难|80|
> |2|v|81|
> +-+-+—+
> ```
>  
> But in 2.4.5. i can only get,but without result showing out:
> create dataframe finished
>  [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32358) temp view not working after upgrading from 2.3.3 to 2.4.5

2020-07-19 Thread philipse (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-32358:
-
Description: 
After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . 
Please correct me if i miss sth. Thanks!

Steps to reproduce:

```

from pyspark.sql import SparkSession
 from pyspark.sql import Row
 spark=SparkSession\
 .builder \
 .appName('scenary_address_1') \
 .enableHiveSupport() \
 .getOrCreate()
 
address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
 print("create dataframe finished")
 address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
 print(spark.read.table('scenery_address_test1').dtypes)
 spark.sql("select * from scenery_address_test1").show()

```

 

In spark2.3.3  i can easily gey the following result:

```

create dataframe finished
 [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
 ++--++---
|a|b|c|

++--++---
|1|难|80|
|2|v|81|

++--+—+

```

 

But in 2.4.5. i can only get:

create dataframe finished
 [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]

  was:
After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . i 
am not sure if if missing sth 

Steps to reproduce:

```

from pyspark.sql import SparkSession
from pyspark.sql import Row
spark=SparkSession\
.builder \
.appName('scenary_address_1') \
.enableHiveSupport() \
.getOrCreate()
address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
print("create dataframe finished")
address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
print(spark.read.table('scenery_address_test1').dtypes)
spark.sql("select * from scenery_address_test1").show()

```

 

In spark2.3.3  i can easily gey the following result:

```

create dataframe finished
[('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 难| 80|
| 2| v| 81|
+---+---+—+

```

 

But in 2.4.5. i can only get:

create dataframe finished
[('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]


> temp view not working after upgrading from 2.3.3 to 2.4.5
> -
>
> Key: SPARK-32358
> URL: https://issues.apache.org/jira/browse/SPARK-32358
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: philipse
>Priority: Major
>
> After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . 
> Please correct me if i miss sth. Thanks!
> Steps to reproduce:
> ```
> from pyspark.sql import SparkSession
>  from pyspark.sql import Row
>  spark=SparkSession\
>  .builder \
>  .appName('scenary_address_1') \
>  .enableHiveSupport() \
>  .getOrCreate()
>  
> address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
>  print("create dataframe finished")
>  address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
>  print(spark.read.table('scenery_address_test1').dtypes)
>  spark.sql("select * from scenery_address_test1").show()
> ```
>  
> In spark2.3.3  i can easily gey the following result:
> ```
> create dataframe finished
>  [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
>  ++--++---
> |a|b|c|
> ++--++---
> |1|难|80|
> |2|v|81|
> ++--+—+
> ```
>  
> But in 2.4.5. i can only get:
> create dataframe finished
>  [('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32358) temp view not working after upgrading from 2.3.3 to 2.4.5

2020-07-19 Thread philipse (Jira)

philipse created SPARK-32358:


 Summary: temp view not working after upgrading from 2.3.3 to 2.4.5
 Key: SPARK-32358
 URL: https://issues.apache.org/jira/browse/SPARK-32358
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5
Reporter: philipse


After upgrading from 2.3.3 to spark 2.4.5. the temp view seems not working . i 
am not sure if if missing sth 

Steps to reproduce:

```

from pyspark.sql import SparkSession
from pyspark.sql import Row
spark=SparkSession\
.builder \
.appName('scenary_address_1') \
.enableHiveSupport() \
.getOrCreate()
address_tok_result_df=spark.createDataFrame([Row(a=1,b='难',c=80),Row(a=2,b='v',c=81)])
print("create dataframe finished")
address_tok_result_df.createOrReplaceTempView("scenery_address_test1")
print(spark.read.table('scenery_address_test1').dtypes)
spark.sql("select * from scenery_address_test1").show()

```

 

In spark2.3.3  i can easily gey the following result:

```

create dataframe finished
[('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 难| 80|
| 2| v| 81|
+---+---+—+

```

 

But in 2.4.5. i can only get:

create dataframe finished
[('a', 'bigint'), ('b', 'string'), ('c', 'bigint')]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160627#comment-17160627
 ] 

JinxinTang commented on SPARK-32347:


cc [~ibobak] Thanks for your issue, I have just raised a PR for this.

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160620#comment-17160620
 ] 

Apache Spark commented on SPARK-32347:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/29156

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32347:


Assignee: Apache Spark

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32347:


Assignee: (was: Apache Spark)

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32347) BROADCAST hint makes a weird message that "column can't be resolved" (it was OK in Spark 2.4)

2020-07-19 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160619#comment-17160619
 ] 

Apache Spark commented on SPARK-32347:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/29156

> BROADCAST hint makes a weird message that "column can't be resolved" (it was 
> OK in Spark 2.4)
> -
>
> Key: SPARK-32347
> URL: https://issues.apache.org/jira/browse/SPARK-32347
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0, jupyter notebook, spark launched in 
> local[4] mode, but with Standalone cluster it also fails the same way.
>  
>  
>Reporter: Ihor Bobak
>Priority: Major
> Fix For: 3.0.1
>
> Attachments: 2020-07-17 17_46_32-Window.png, 2020-07-17 
> 17_49_27-Window.png, 2020-07-17 17_52_51-Window.png
>
>
> The bug is very easily reproduced: run the following same code in Spark 
> 2.4.3. and in 3.0.0.
> The SQL parser will raise an invalid error message with 3.0.0, although 
> everything seems to be OK with the SQL statement and it works fine in Spark 
> 2.4.3
> {code:python}
> import pandas as pd
> pdf_sales = pd.DataFrame([(1, 10), (2, 20)], columns=["BuyerID", "Qty"])
> pdf_buyers = pd.DataFrame([(1, "John"), (2, "Jack")], columns=["BuyerID", 
> "BuyerName"])
> df_sales = spark.createDataFrame(pdf_sales)
> df_buyers = spark.createDataFrame(pdf_buyers)
> df_sales.createOrReplaceTempView("df_sales")
> df_buyers.createOrReplaceTempView("df_buyers")
> spark.sql("""
> with b as (
> select /*+ BROADCAST(df_buyers) */
> BuyerID, BuyerName 
> from df_buyers
> )
> select 
> b.BuyerID,
> b.BuyerName,
> s.Qty
> from df_sales s
> inner join b on s.BuyerID =  b.BuyerID
> """).toPandas()
> {code}
> The (wrong) error message:
> ---
> AnalysisException Traceback (most recent call last)
>  in 
>  22 from df_sales s
>  23 inner join b on s.BuyerID =  b.BuyerID
> ---> 24 """).toPandas()
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/session.py in 
> sql(self, sqlQuery)
> 644 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, 
> f2=u'row3')]
> 645 """
> --> 646 return DataFrame(self._jsparkSession.sql(sqlQuery), 
> self._wrapped)
> 647 
> 648 @since(2.0)
> /opt/spark-3.0.0-bin-without-hadoop/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1303 answer = self.gateway_client.send_command(command)
>1304 return_value = get_return_value(
> -> 1305 answer, self.gateway_client, self.target_id, self.name)
>1306 
>1307 for temp_arg in temp_args:
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
> 135 # Hide where the exception came from that shows a 
> non-Pythonic
> 136 # JVM exception message.
> --> 137 raise_from(converted)
> 138 else:
> 139 raise
> /opt/spark-3.0.0-bin-without-hadoop/python/pyspark/sql/utils.py in 
> raise_from(e)
> AnalysisException: cannot resolve '`s.BuyerID`' given input columns: 
> [s.BuyerID, b.BuyerID, b.BuyerName, s.Qty]; line 12 pos 24;
> 'Project ['b.BuyerID, 'b.BuyerName, 's.Qty]
> +- 'Join Inner, ('s.BuyerID = 'b.BuyerID)
>:- SubqueryAlias s
>:  +- SubqueryAlias df_sales
>: +- LogicalRDD [BuyerID#23L, Qty#24L], false
>+- SubqueryAlias b
>   +- Project [BuyerID#27L, BuyerName#28]
>  +- SubqueryAlias df_buyers
> +- LogicalRDD [BuyerID#27L, BuyerName#28], false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

50 matches

Mail list logo