[jira] [Created] (SPARK-31344) Polish implementation of barrier() and allGather()

2020-04-03 Thread wuyi (Jira)
wuyi created SPARK-31344:


 Summary: Polish implementation of barrier() and allGather()
 Key: SPARK-31344
 URL: https://issues.apache.org/jira/browse/SPARK-31344
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


Currently, implementation of barrier() and allGather() has much duplicate 
codes, we should polish them to make code simpler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging

2020-04-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074961#comment-17074961
 ] 

Hyukjin Kwon commented on SPARK-31231:
--

This was fixed in setuptools https://github.com/pypa/setuptools/pull/2046

> Support setuptools 46.1.0+ in PySpark packaging
> ---
>
> Key: SPARK-31231
> URL: https://issues.apache.org/jira/browse/SPARK-31231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> PIP packaging test started to fail (see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)
>  as of  setuptools 46.1.0 release.
> In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep 
> the modes in {{package_data}}. In PySpark pip installation, we keep the 
> executable scripts in {{package_data}} 
> https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and 
> expose their symbolic links as executable scripts.
> So, the symbolic links (or copied scripts) executes the scripts copied from 
> {{package_data}}, which didn't keep the modes:
> {code}
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> Permission denied
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> cannot execute: Permission denied
> {code}
> The current issue is being tracked at 
> https://github.com/pypa/setuptools/issues/2041



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging

2020-04-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31231.
--
  Assignee: Hyukjin Kwon
Resolution: Fixed

> Support setuptools 46.1.0+ in PySpark packaging
> ---
>
> Key: SPARK-31231
> URL: https://issues.apache.org/jira/browse/SPARK-31231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> PIP packaging test started to fail (see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)
>  as of  setuptools 46.1.0 release.
> In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep 
> the modes in {{package_data}}. In PySpark pip installation, we keep the 
> executable scripts in {{package_data}} 
> https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and 
> expose their symbolic links as executable scripts.
> So, the symbolic links (or copied scripts) executes the scripts copied from 
> {{package_data}}, which didn't keep the modes:
> {code}
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> Permission denied
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> cannot execute: Permission denied
> {code}
> The current issue is being tracked at 
> https://github.com/pypa/setuptools/issues/2041



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-04-03 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074889#comment-17074889
 ] 

L. C. Hsieh commented on SPARK-31308:
-

Thank you [~dongjoon]

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters

2020-04-03 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31343:
--

 Summary: Check codegen does not fail on expressions with special 
characters in string parameters
 Key: SPARK-31343
 URL: https://issues.apache.org/jira/browse/SPARK-31343
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Add tests similar to tests added by the PR 
https://github.com/apache/spark/pull/20182 for from_utc_timestamp / 
to_utc_timestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31342) Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582

2020-04-03 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-31342:
-

 Summary: Fail by default if Parquet DATE or TIMESTAMP data is 
before October 15, 1582
 Key: SPARK-31342
 URL: https://issues.apache.org/jira/browse/SPARK-31342
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Bruce Robbins


Some users may not know they are creating and/or reading DATE or TIMESTAMP data 
from before October 15, 1582 (because of data encoding libraries, etc.). 
Therefore, it may not be clear whether they need to toggle the two 
rebaseDateTime config settings.

By default, Spark should fail if it reads or writes data from October 15, 1582 
or before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-04-03 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074745#comment-17074745
 ] 

Bruce Robbins commented on SPARK-30951:
---

{quote}
we can fail by default when reading datetime values before 1582 from parquet 
files.
{quote}
That sounds reasonable. I can make a PR, but if someone beats me to it, I won't 
complain.

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data 

[jira] [Created] (SPARK-31341) Spark documentation incorrectly claims 3.8 compatibility

2020-04-03 Thread Daniel King (Jira)
Daniel King created SPARK-31341:
---

 Summary: Spark documentation incorrectly claims 3.8 compatibility
 Key: SPARK-31341
 URL: https://issues.apache.org/jira/browse/SPARK-31341
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5
Reporter: Daniel King


The Spark documentation ([https://spark.apache.org/docs/latest/]) has this text:
{quote}Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, 
Spark 2.4.5 uses Scala 2.12. You will need to use a compatible Scala version 
(2.12.x).
{quote}
Which suggests that Spark is compatible with Python 3.8. This is not true. For 
example in the latest ubuntu:18.04 docker image:

 
{code:python}
apt-get update
apt-get install python3.8 python3-pip
pip3 install pyspark
python3.8 -m pip install pyspark
python3.8 -c 'import pyspark'
{code}
Outputs:

{code:python}
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.8/dist-packages/pyspark/__init__.py", line 51, 
in 
from pyspark.context import SparkContext
  File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 31, in 

from pyspark import accumulators
  File "/usr/local/lib/python3.8/dist-packages/pyspark/accumulators.py", line 
97, in 
from pyspark.serializers import read_int, PickleSerializer
  File "/usr/local/lib/python3.8/dist-packages/pyspark/serializers.py", line 
72, in 
from pyspark import cloudpickle
  File "/usr/local/lib/python3.8/dist-packages/pyspark/cloudpickle.py", line 
145, in 
_cell_set_template_code = _make_cell_set_template_code()
  File "/usr/local/lib/python3.8/dist-packages/pyspark/cloudpickle.py", line 
126, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)
{code}

I propose the documentation is updated to say "Python 3.4 to 3.7". I also 
propose the `setup.py` file for pyspark include:
{code:python}
python_requires=">=3.6,<3.8",
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2020-04-03 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074689#comment-17074689
 ] 

Dongjoon Hyun commented on SPARK-25102:
---

Are we going to have 2.4.7 or 2.4.8?  For now, 2.4.6 is the last planned 
release. Could you send an email to dev mailing list about your LTS plan at 
2.4.x first?
cc [~dbtsai]

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode

2020-04-03 Thread Jim Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Huang reopened SPARK-31276:
---

Clarify the question with more succinct scenario that describe the the 
challenge of URI not being able to differentiate between driver vs executor.  

> Contrived working example that works with multiple URI file storages for 
> Spark cluster mode
> ---
>
> Key: SPARK-31276
> URL: https://issues.apache.org/jira/browse/SPARK-31276
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Affects Versions: 2.4.5
>Reporter: Jim Huang
>Priority: Major
>
> This Spark SQL Guide --> Data sources --> Generic Load/Save Functions
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
> described a very simple "local file system load of an example file".  
>  
> I am looking for an example that demonstrates a workflow that exercises 
> different file systems.  For example, 
>  # Driver loads an input file from local file system
>  # Add a simple column using lit() and stores that DataFrame in cluster mode 
> to HDFS
>  # Write that a small limited subset of that DataFrame back to Driver's local 
> file system.  (This is to avoid the anti-pattern of writing large file and 
> out of the scope for this example.  The small limited DataFrame would be some 
> basic statistics, not the actual complete dataset.)
>  
> The examples I found on the internet only uses simple paths without the 
> explicit URI prefixes.
> Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) 
> was called, local stand alone vs YARN client mode.   So a "filepath" will be 
> read/write locally (file system) vs cluster mode HDFS, without these explicit 
> URIs.
> There are situations were a Spark program needs to deal with both local file 
> system and YARN client mode (big data) in the same Spark application, like 
> producing a summary table stored on the local file system of the driver at 
> the end.  
> If there are any existing alternatives Spark documentation that provides 
> examples that traverse through the different URIs in Spark YARN client mode 
> or a better or smarter Spark pattern or API that is more suited for this, I 
> am happy to accept that as well.  Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode

2020-04-03 Thread Jim Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074677#comment-17074677
 ] 

Jim Huang edited comment on SPARK-31276 at 4/3/20, 3:51 PM:


Clarified the question with more succinct scenario that describe the the 
challenge of URI not being able to differentiate between driver vs executor.  


was (Author: jimhuang):
Clarify the question with more succinct scenario that describe the the 
challenge of URI not being able to differentiate between driver vs executor.  

> Contrived working example that works with multiple URI file storages for 
> Spark cluster mode
> ---
>
> Key: SPARK-31276
> URL: https://issues.apache.org/jira/browse/SPARK-31276
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Affects Versions: 2.4.5
>Reporter: Jim Huang
>Priority: Major
>
> This Spark SQL Guide --> Data sources --> Generic Load/Save Functions
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
> described a very simple "local file system load of an example file".  
>  
> I am looking for an example that demonstrates a workflow that exercises 
> different file systems.  For example, 
>  # Driver loads an input file from local file system
>  # Add a simple column using lit() and stores that DataFrame in cluster mode 
> to HDFS
>  # Write that a small limited subset of that DataFrame back to Driver's local 
> file system.  (This is to avoid the anti-pattern of writing large file and 
> out of the scope for this example.  The small limited DataFrame would be some 
> basic statistics, not the actual complete dataset.)
>  
> The examples I found on the internet only uses simple paths without the 
> explicit URI prefixes.
> Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) 
> was called, local stand alone vs YARN client mode.   So a "filepath" will be 
> read/write locally (file system) vs cluster mode HDFS, without these explicit 
> URIs.
> There are situations were a Spark program needs to deal with both local file 
> system and YARN client mode (big data) in the same Spark application, like 
> producing a summary table stored on the local file system of the driver at 
> the end.  
> If there are any existing alternatives Spark documentation that provides 
> examples that traverse through the different URIs in Spark YARN client mode 
> or a better or smarter Spark pattern or API that is more suited for this, I 
> am happy to accept that as well.  Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode

2020-04-03 Thread Jim Huang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074676#comment-17074676
 ] 

Jim Huang commented on SPARK-31276:
---

The fault is my for not framing the scenario more succinctly.  


In the scenario when spark process is launched / deployed as `client` mode in 
YARN cluster:
 # What is the opposite of " {{sc.parallelize(data)}}"?
 ## Using "file:///" URI, writes the file "locally" on the executors nodes' 
file system, not the driver's local FS.  The challenge I have is that I am 
unable to explicitly specify the "file:///" URI to differentiate the driver vs 
executor.  

 

I am aware this is an possible corner case that can be exploited 
unintentionally when that dataset is too big and overflow the driver memory.  
But it is also a valid use case if an user just want to store some summary 
statistics after some big data set processing on the driver local file system.  

 

> Contrived working example that works with multiple URI file storages for 
> Spark cluster mode
> ---
>
> Key: SPARK-31276
> URL: https://issues.apache.org/jira/browse/SPARK-31276
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Affects Versions: 2.4.5
>Reporter: Jim Huang
>Priority: Major
>
> This Spark SQL Guide --> Data sources --> Generic Load/Save Functions
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
> described a very simple "local file system load of an example file".  
>  
> I am looking for an example that demonstrates a workflow that exercises 
> different file systems.  For example, 
>  # Driver loads an input file from local file system
>  # Add a simple column using lit() and stores that DataFrame in cluster mode 
> to HDFS
>  # Write that a small limited subset of that DataFrame back to Driver's local 
> file system.  (This is to avoid the anti-pattern of writing large file and 
> out of the scope for this example.  The small limited DataFrame would be some 
> basic statistics, not the actual complete dataset.)
>  
> The examples I found on the internet only uses simple paths without the 
> explicit URI prefixes.
> Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) 
> was called, local stand alone vs YARN client mode.   So a "filepath" will be 
> read/write locally (file system) vs cluster mode HDFS, without these explicit 
> URIs.
> There are situations were a Spark program needs to deal with both local file 
> system and YARN client mode (big data) in the same Spark application, like 
> producing a summary table stored on the local file system of the driver at 
> the end.  
> If there are any existing alternatives Spark documentation that provides 
> examples that traverse through the different URIs in Spark YARN client mode 
> or a better or smarter Spark pattern or API that is more suited for this, I 
> am happy to accept that as well.  Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31340) No call to destroy() for filter in SparkHistory

2020-04-03 Thread thierry accart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

thierry accart updated SPARK-31340:
---
Summary: No call to destroy() for filter in SparkHistory  (was: No call to 
destroy() for authentication filter in SparkHistory)

> No call to destroy() for filter in SparkHistory
> ---
>
> Key: SPARK-31340
> URL: https://issues.apache.org/jira/browse/SPARK-31340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: thierry accart
>Priority: Major
>
> Adding  UI filter AuthenticationFilter (from Hadoop) causes Spark application 
> to never end, due to threads created in this class not interrupted.
> *To reproduce*
> Start a local spark context with hadoop-auth 3.1.0 
> {{spark.ui.enabled=true}}
> {{spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter}}
> {{#and all required ldap props}}
> {{ 
> spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.param.ldap.*=...}}
> *What's happening :*
> In [AuthenticationFilter's 
> |https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/server/AuthenticationFilter.java]
>  init we have the following chain:
> {{(line.178) initializeSecretProvider(filterConfig);}}
> {{(}}{{line.209) secretProvider = constructSecretProvider(...)}}
> {{(}}{{line 237) provider.init(config, ctx, validity);}}
> If no config is specified provider will be [RolloverSignerSecretProvider 
> |https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/util/RolloverSignerSecretProvider.java]which
>  will (line 95) start a new thread via
> {{scheduler = Executors.newSingleThreadScheduledExecutor();}}
> The created thread will be stopped in destroy() method (line 106).
> *Unfortunately, this destroy() method is not called* when SparkHistory is 
> closed, leaving threads running.
>  
> This ticket is not here to address the particular case of Hadoop's 
> authentication filter, but to ensure that any Filter added in spark.ui will 
> have its destroy() method called.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31340) No call to destroy() for authentication filter in SparkHistory

2020-04-03 Thread thierry accart (Jira)
thierry accart created SPARK-31340:
--

 Summary: No call to destroy() for authentication filter in 
SparkHistory
 Key: SPARK-31340
 URL: https://issues.apache.org/jira/browse/SPARK-31340
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.5
Reporter: thierry accart


Adding  UI filter AuthenticationFilter (from Hadoop) causes Spark application 
to never end, due to threads created in this class not interrupted.

*To reproduce*

Start a local spark context with hadoop-auth 3.1.0 
{{spark.ui.enabled=true}}
{{spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter}}
{{#and all required ldap props}}
{{ 
spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.param.ldap.*=...}}

*What's happening :*

In [AuthenticationFilter's 
|https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/server/AuthenticationFilter.java]
 init we have the following chain:

{{(line.178) initializeSecretProvider(filterConfig);}}
{{(}}{{line.209) secretProvider = constructSecretProvider(...)}}
{{(}}{{line 237) provider.init(config, ctx, validity);}}

If no config is specified provider will be [RolloverSignerSecretProvider 
|https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/util/RolloverSignerSecretProvider.java]which
 will (line 95) start a new thread via
{{scheduler = Executors.newSingleThreadScheduledExecutor();}}

The created thread will be stopped in destroy() method (line 106).

*Unfortunately, this destroy() method is not called* when SparkHistory is 
closed, leaving threads running.

 

This ticket is not here to address the particular case of Hadoop's 
authentication filter, but to ensure that any Filter added in spark.ui will 
have its destroy() method called.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31327) write spark version to avro file metadata

2020-04-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31327.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28102
[https://github.com/apache/spark/pull/28102]

> write spark version to avro file metadata
> -
>
> Key: SPARK-31327
> URL: https://issues.apache.org/jira/browse/SPARK-31327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30840) Add version property for ConfigEntry and ConfigBuilder

2020-04-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30840:
-
Fix Version/s: 3.0.0

> Add version property for ConfigEntry and ConfigBuilder
> --
>
> Key: SPARK-30840
> URL: https://issues.apache.org/jira/browse/SPARK-30840
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()

2020-04-03 Thread Suraj (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj updated SPARK-31339:
--
Description: 
PR: [https://github.com/apache/spark/pull/28110]
 * What changes were proposed in this pull request?
 pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)
 * Why are the changes needed?
 This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
 E.g. Current issue:

{code:java}
CustomPipelineModel(PipelineModel):
    def _transform(self, df):
        ...
 CustomPipelineModel.save('path/to/file') # works
 CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
 CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform(){code}
 * Does this introduce any user-facing change?
 No.

  was:
PR: [https://github.com/apache/spark/pull/28110]

What changes were proposed in this pull request?
 pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)

Why are the changes needed?
 This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
 E.g. Current issue:
{code:java}
CustomPipelineModel(PipelineModel):
    def _transform(self, df):
        ...
 CustomPipelineModel.save('path/to/file') # works
 CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
 CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform(){code}
Does this introduce any user-facing change?
 No.


> Changed PipelineModel(...) to self.cls(...) in 
> pyspark.ml.pipeline.PipelineModelReader.load()
> -
>
> Key: SPARK-31339
> URL: https://issues.apache.org/jira/browse/SPARK-31339
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Suraj
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PR: [https://github.com/apache/spark/pull/28110]
>  * What changes were proposed in this pull request?
>  pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)
>  * Why are the changes needed?
>  This change fixes the loading of class (which inherits from PipelineModel 
> class) from file.
>  E.g. Current issue:
> {code:java}
> CustomPipelineModel(PipelineModel):
>     def _transform(self, df):
>         ...
>  CustomPipelineModel.save('path/to/file') # works
>  CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
> instead of CustomPipelineModel()
>  CustomPipelineModel.transform() # wrong: results in calling 
> PipelineModel.transform() instead of CustomPipelineModel.transform(){code}
>  * Does this introduce any user-facing change?
>  No.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()

2020-04-03 Thread Suraj (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj updated SPARK-31339:
--
Description: 
PR: [https://github.com/apache/spark/pull/28110]

What changes were proposed in this pull request?
 pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)

Why are the changes needed?
 This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
 E.g. Current issue:
{code:java}
CustomPipelineModel(PipelineModel):
    def _transform(self, df):
        ...
 CustomPipelineModel.save('path/to/file') # works
 CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
 CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform(){code}
Does this introduce any user-facing change?
 No.

  was:
PR: [https://github.com/apache/spark/pull/28110]

What changes were proposed in this pull request?
 pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)

Why are the changes needed?
 This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
 E.g. Current issue:
 ```
 CustomPipelineModel(PipelineModel):
 def _transform(self, df):
 ...
 CustomPipelineModel.save('path/to/file') # works
 CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
 CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform()
 ```

Does this introduce any user-facing change?
 No.


> Changed PipelineModel(...) to self.cls(...) in 
> pyspark.ml.pipeline.PipelineModelReader.load()
> -
>
> Key: SPARK-31339
> URL: https://issues.apache.org/jira/browse/SPARK-31339
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Suraj
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PR: [https://github.com/apache/spark/pull/28110]
> What changes were proposed in this pull request?
>  pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)
> Why are the changes needed?
>  This change fixes the loading of class (which inherits from PipelineModel 
> class) from file.
>  E.g. Current issue:
> {code:java}
> CustomPipelineModel(PipelineModel):
>     def _transform(self, df):
>         ...
>  CustomPipelineModel.save('path/to/file') # works
>  CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
> instead of CustomPipelineModel()
>  CustomPipelineModel.transform() # wrong: results in calling 
> PipelineModel.transform() instead of CustomPipelineModel.transform(){code}
> Does this introduce any user-facing change?
>  No.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()

2020-04-03 Thread Suraj (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj updated SPARK-31339:
--
Description: 
PR: [https://github.com/apache/spark/pull/28110]

What changes were proposed in this pull request?
 pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)

Why are the changes needed?
 This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
 E.g. Current issue:
 ```
 CustomPipelineModel(PipelineModel):
 def _transform(self, df):
 ...
 CustomPipelineModel.save('path/to/file') # works
 CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
 CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform()
 ```

Does this introduce any user-facing change?
 No.

  was:
PR: [https://github.com/apache/spark/pull/28110]

### What changes were proposed in this pull request?
pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)

### Why are the changes needed?
This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
E.g. Current issue:
```
CustomPipelineModel(PipelineModel):
 def _transform(self, df):
 ...
CustomPipelineModel.save('path/to/file') # works
CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform()
```

### Does this introduce any user-facing change?
No.


> Changed PipelineModel(...) to self.cls(...) in 
> pyspark.ml.pipeline.PipelineModelReader.load()
> -
>
> Key: SPARK-31339
> URL: https://issues.apache.org/jira/browse/SPARK-31339
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Suraj
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> PR: [https://github.com/apache/spark/pull/28110]
> What changes were proposed in this pull request?
>  pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)
> Why are the changes needed?
>  This change fixes the loading of class (which inherits from PipelineModel 
> class) from file.
>  E.g. Current issue:
>  ```
>  CustomPipelineModel(PipelineModel):
>  def _transform(self, df):
>  ...
>  CustomPipelineModel.save('path/to/file') # works
>  CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
> instead of CustomPipelineModel()
>  CustomPipelineModel.transform() # wrong: results in calling 
> PipelineModel.transform() instead of CustomPipelineModel.transform()
>  ```
> Does this introduce any user-facing change?
>  No.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()

2020-04-03 Thread Suraj (Jira)
Suraj created SPARK-31339:
-

 Summary: Changed PipelineModel(...) to self.cls(...) in 
pyspark.ml.pipeline.PipelineModelReader.load()
 Key: SPARK-31339
 URL: https://issues.apache.org/jira/browse/SPARK-31339
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.4.5
Reporter: Suraj


PR: [https://github.com/apache/spark/pull/28110]

### What changes were proposed in this pull request?
pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...)

### Why are the changes needed?
This change fixes the loading of class (which inherits from PipelineModel 
class) from file.
E.g. Current issue:
```
CustomPipelineModel(PipelineModel):
 def _transform(self, df):
 ...
CustomPipelineModel.save('path/to/file') # works
CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() 
instead of CustomPipelineModel()
CustomPipelineModel.transform() # wrong: results in calling 
PipelineModel.transform() instead of CustomPipelineModel.transform()
```

### Does this introduce any user-facing change?
No.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

2020-04-03 Thread Mohit Dave (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Dave updated SPARK-31338:
---
Description: 
h2. *Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

*Table defination :* 
 postgres=> \d lineitem_sf1000
 Table "public.lineitem_sf1000"
 Column | Type | Modifiers
 -++--
 *l_orderkey | bigint | not null*
 l_partkey | bigint | not null
 l_suppkey | bigint | not null
 l_linenumber | bigint | not null
 l_quantity | numeric(10,2) | not null
 l_extendedprice | numeric(10,2) | not null
 l_discount | numeric(10,2) | not null
 l_tax | numeric(10,2) | not null
 l_returnflag | character varying(1) | not null
 l_linestatus | character varying(1) | not null
 l_shipdate | character varying(29) | not null
 l_commitdate | character varying(29) | not null
 l_receiptdate | character varying(29) | not null
 l_shipinstruct | character varying(25) | not null
 l_shipmode | character varying(10) | not null
 l_comment | character varying(44) | not null
 Indexes:
 "l_order_sf1000_idx" btree (l_orderkey)

 

*Partition column* : l_orderkey 

*numpartion* : 16 
h2. *Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND 
l_orderkey < 187501 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks 
like this below:
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or 
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any 
data in the table for the partition key having NULL values.*

This hurts performance badly. While the first 15 SQLs took approximately 10 
minutes to execute, the last SQL with the NULL check takes 45 minutes because 
it has to evaluate a second scan(OR clause) of the table for NULL values for 
the partition key.

*Note that I have defined the partition key of the table to be NOT NULL, at the 
database. Therefore, the SQL for the last partition need not have this NULL 
check, Spark SQl should be able to avoid such condition and this Jira is 
intended to fix this behavior.*
{code:java}
 {code}
 

 

  was:
*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
 Table "public.lineitem_sf1000"
 Column | Type | Modifiers
-+---+---
 l_orderkey | bigint | not null
 l_partkey | bigint | not null
 l_suppkey | bigint | not null
 l_linenumber | bigint | not null
 l_quantity | numeric(10,2) | not null
 l_extendedprice | numeric(10,2) | not null
 l_discount | numeric(10,2) | not null
 l_tax | numeric(10,2) | not null
 l_returnflag | character varying(1) | not null
 l_linestatus | character varying(1) | not null
 l_shipdate | character varying(29) | not null
 l_commitdate | character varying(29) | not null
 l_receiptdate | character varying(29) | not null
 l_shipinstruct | character varying(25) | not null
 l_shipmode | character varying(10) | not null
 l_comment | character varying(44) | not null
Indexes:
 "l_order_sf1000_idx" btree (l_orderkey) 

 

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 

[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

2020-04-03 Thread Mohit Dave (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Dave updated SPARK-31338:
---
Description: 
*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
 Table "public.lineitem_sf1000"
 Column | Type | Modifiers
-+---+---
 l_orderkey | bigint | not null
 l_partkey | bigint | not null
 l_suppkey | bigint | not null
 l_linenumber | bigint | not null
 l_quantity | numeric(10,2) | not null
 l_extendedprice | numeric(10,2) | not null
 l_discount | numeric(10,2) | not null
 l_tax | numeric(10,2) | not null
 l_returnflag | character varying(1) | not null
 l_linestatus | character varying(1) | not null
 l_shipdate | character varying(29) | not null
 l_commitdate | character varying(29) | not null
 l_receiptdate | character varying(29) | not null
 l_shipinstruct | character varying(25) | not null
 l_shipmode | character varying(10) | not null
 l_comment | character varying(44) | not null
Indexes:
 "l_order_sf1000_idx" btree (l_orderkey) 

 

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND 
l_orderkey < 187501 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks 
like this below:
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or 
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any 
data in the table for the partition key having NULL values.*

This hurts performance badly. While the first 15 SQLs took approximately 10 
minutes to execute, the last SQL with the NULL check takes 45 minutes because 
it has to evaluate a second scan(OR clause) of the table for NULL values for 
the partition key.

*Note that I have defined the partition key of the table to be NOT NULL, at the 
database. Therefore, the SQL for the last partition need not have this NULL 
check, Spark SQl should be able to avoid such condition and this Jira is 
intended to fix this behavior.*
{code:java}
 {code}
 

 

  was:
*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
   Table "public.lineitem_sf1000" Column  | Type
  | Modifiers
-+---+---
 l_orderkey  | bigint| not null l_partkey   | bigint
| not null l_suppkey   | bigint| not null 
l_linenumber| bigint| not null l_quantity  | 
numeric(10,2) | not null l_extendedprice | numeric(10,2) | not 
null l_discount  | numeric(10,2) | not null l_tax   | 
numeric(10,2) | not null l_returnflag| character varying(1)  | not 
null l_linestatus| character varying(1)  | not null l_shipdate  | 
character varying(29) | not null l_commitdate| character varying(29) | not 
null l_receiptdate   | character varying(29) | not null l_shipinstruct  | 
character varying(25) | not null l_shipmode  | character varying(10) | not 
null l_comment   | character varying(44) | not nullIndexes:
"l_order_sf1000_idx" btree (l_orderkey) 

  

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 

[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

2020-04-03 Thread Mohit Dave (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Dave updated SPARK-31338:
---
Description: 
*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
   Table "public.lineitem_sf1000" Column  | Type
  | Modifiers
-+---+---
 l_orderkey  | bigint| not null l_partkey   | bigint
| not null l_suppkey   | bigint| not null 
l_linenumber| bigint| not null l_quantity  | 
numeric(10,2) | not null l_extendedprice | numeric(10,2) | not 
null l_discount  | numeric(10,2) | not null l_tax   | 
numeric(10,2) | not null l_returnflag| character varying(1)  | not 
null l_linestatus| character varying(1)  | not null l_shipdate  | 
character varying(29) | not null l_commitdate| character varying(29) | not 
null l_receiptdate   | character varying(29) | not null l_shipinstruct  | 
character varying(25) | not null l_shipmode  | character varying(10) | not 
null l_comment   | character varying(44) | not nullIndexes:
"l_order_sf1000_idx" btree (l_orderkey) 

  

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND 
l_orderkey < 187501 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks 
like this below:
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or 
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any 
data in the table for the partition key having NULL values.*

This hurts performance badly. While the first 15 SQLs took approximately 10 
minutes to execute, the last SQL with the NULL check takes 45 minutes because 
it has to evaluate a second scan(OR clause) of the table for NULL values for 
the partition key.

*Note that I have defined the partition key of the table to be NOT NULL, at the 
database. Therefore, the SQL for the last partition need not have this NULL 
check, Spark SQl should be able to avoid such condition and this Jira is 
intended to fix this behavior.*
{code:java}
 {code}
 

 

  was:
*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
   Table "public.lineitem_sf1000" Column  | Type
  | Modifiers
-+---+---
 l_orderkey  | bigint| not *null* l_partkey   | bigint  
  | not null l_suppkey   | bigint| not null 
l_linenumber| bigint| not null l_quantity  | 
numeric(10,2) | not null l_extendedprice | numeric(10,2) | not 
null l_discount  | numeric(10,2) | not null l_tax   | 
numeric(10,2) | not null l_returnflag| character varying(1)  | not 
null l_linestatus| character varying(1)  | not null l_shipdate  | 
character varying(29) | not null l_commitdate| character varying(29) | not 
null l_receiptdate   | character varying(29) | not null l_shipinstruct  | 
character varying(25) | not null l_shipmode  | character varying(10) | not 
null l_comment   | character varying(44) | not nullIndexes:
"l_order_sf1000_idx" btree (l_orderkey) 
 

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 

[jira] [Created] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

2020-04-03 Thread Mohit Dave (Jira)
Mohit Dave created SPARK-31338:
--

 Summary: Spark SQL JDBC Data Source partitioned read : Spark SQL 
does not honor for NOT NULL table definition of partition key.
 Key: SPARK-31338
 URL: https://issues.apache.org/jira/browse/SPARK-31338
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5
Reporter: Mohit Dave


*Our Use-case Details:*

While reading from a jdbc source using spark sql, we are using below read 
format :

jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
upperBound: Long, numPartitions: Int, connectionProperties: Properties).

 

Table defination : 
postgres=> \d lineitem_sf1000
   Table "public.lineitem_sf1000" Column  | Type
  | Modifiers
-+---+---
 l_orderkey  | bigint| not *null* l_partkey   | bigint  
  | not null l_suppkey   | bigint| not null 
l_linenumber| bigint| not null l_quantity  | 
numeric(10,2) | not null l_extendedprice | numeric(10,2) | not 
null l_discount  | numeric(10,2) | not null l_tax   | 
numeric(10,2) | not null l_returnflag| character varying(1)  | not 
null l_linestatus| character varying(1)  | not null l_shipdate  | 
character varying(29) | not null l_commitdate| character varying(29) | not 
null l_receiptdate   | character varying(29) | not null l_shipinstruct  | 
character varying(25) | not null l_shipmode  | character varying(10) | not 
null l_comment   | character varying(44) | not nullIndexes:
"l_order_sf1000_idx" btree (l_orderkey) 
 

Partition column : l_orderkey 

numpartion : 16

 

*Problem details :* 

 
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND 
l_orderkey < 187501 {code}
15 queries are generated with the above BETWEEN clauses. The last query looks 
like this below:
{code:java}
SELECT 
"l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag"
 FROM (SELECT 
l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
 FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or 
l_orderkey is null {code}
I*n the last query, we are trying to get the remaining records, along with any 
data in the table for the partition key having NULL values.*

This hurts performance badly. While the first 15 SQLs took approximately 10 
minutes to execute, the last SQL with the NULL check takes 45 minutes because 
it has to evaluate a second scan(OR clause) of the table for NULL values for 
the partition key.

*Note that I have defined the partition key of the table to be NOT NULL, at the 
database. Therefore, the SQL for the last partition need not have this NULL 
check, Spark SQl should be able to avoid such condition and this Jira is 
intended to fix this behavior.*
{code}
 {code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2020-04-03 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074434#comment-17074434
 ] 

Wenchen Fan commented on SPARK-25102:
-

I'd like to propose to backport it to 2.4. It's very important to have version 
info in the file metadata, to implement backward compatibility. It's 
unfortunate that we start this too late, but it still helps if Spark 2.4.6 
starts to do it, as we will maintain the 2.4 line for a long time.

Any thoughts?

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31272) Support DB2 Kerberos login in JDBC connector

2020-04-03 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074421#comment-17074421
 ] 

Gabor Somogyi commented on SPARK-31272:
---

The code is ready but it's depending on the MariaDB PR. Intended to file a PR 
when MariaDB is ready...

> Support DB2 Kerberos login in JDBC connector
> 
>
> Key: SPARK-31272
> URL: https://issues.apache.org/jira/browse/SPARK-31272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31336) Support Oracle Kerberos login in JDBC connector

2020-04-03 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-31336:
-

 Summary: Support Oracle Kerberos login in JDBC connector
 Key: SPARK-31336
 URL: https://issues.apache.org/jira/browse/SPARK-31336
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31337) Support MS Sql Kerberos login in JDBC connector

2020-04-03 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-31337:
-

 Summary: Support MS Sql Kerberos login in JDBC connector
 Key: SPARK-31337
 URL: https://issues.apache.org/jira/browse/SPARK-31337
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074390#comment-17074390
 ] 

Ismaël Mejía commented on SPARK-31330:
--

Ah thanks for letting me know about the mail I did not do reply-all. i think 
Hyukjin already forwarded it so we should be good. Don't hesitate to ping me in 
the PR or in the INFRA ticket if you need some ref/help.

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31334:
--
Comment: was deleted

(was: cc [~cloud_fan] [~yumwang] )

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074381#comment-17074381
 ] 

angerszhu commented on SPARK-31334:
---

I have found the reason, In analyzer when  logical plan
{code:java}
'Filter ('sum('a) > 3)
+- Aggregate [b#181], [b#181, sum(a#180) AS a#184L]
   +- SubqueryAlias `testdata`
  +- Project [_1#177 AS a#180, _2#178 AS b#181]
 +- LocalRelation [_1#177, _2#178]
{code}
come into ResolveAggregateFunctions, since a is String type and then 
aggregation's expression is unresolved, so  ResolveAggregateFunctions won't 
make a change on above logicalplan, then `sum(a)` in Filter condition will be 
resolved in ResolveReference and this {color:#FF}a 
{color}{color:#172b4d}will be resolved as aggregation's output column a , then 
error happened{color}{color:#FF} .{color}

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2020-04-03 Thread philipse (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074361#comment-17074361
 ] 

philipse commented on SPARK-18681:
--

[~michael] any news for this issue ? i meet the same issue on spark2.4.5

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18681
> URL: https://issues.apache.org/jira/browse/SPARK-18681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.1.0
>
>
> Cloudera put 
> {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}}
>  as the configuration file for the Hive Metastore Server, where 
> {{hive.metastore.try.direct.sql=false}}. But Spark reading the gateway 
> configuration file and get default value 
> {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or 
> {{getMSC.getConfigValue}} method to obtain the original configuration from 
> Hive Metastore Server.
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
> 

[jira] [Created] (SPARK-31335) Add try function support

2020-04-03 Thread Kent Yao (Jira)
Kent Yao created SPARK-31335:


 Summary: Add try function support
 Key: SPARK-31335
 URL: https://issues.apache.org/jira/browse/SPARK-31335
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao



{code:java}
Evaluate an expression and handle certain types of execution errors by 
returning NULL.
In cases where it is preferable that queries produce NULL instead of failing 
when corrupt or invalid data is encountered, the TRY function may be useful 
especially when ANSI mode is on and the users need null-tolerant on certain 
columns or outputs.

AnalysisExceptions will not handle by this, typically errors handled by TRY 
function are: 

  * Division by zero,
  * Invalid casting,
  * Numeric value out of range,
  * e.t.c
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31249) Flaky Test: CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied

2020-04-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31249.
--
Fix Version/s: 3.0.0
 Assignee: wuyi
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28100

> Flaky Test: CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is 
> applied
> -
>
> Key: SPARK-31249
> URL: https://issues.apache.org/jira/browse/SPARK-31249
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120302/testReport/
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did 
> not equal 3
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.$anonfun$new$11(CoarseGrainedSchedulerBackendSuite.scala:186)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074296#comment-17074296
 ] 

angerszhu commented on SPARK-31334:
---

cc [~cloud_fan] [~yumwang] 

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31334:
--
Description: 
{code:java}
```

test("") {
Seq(
  (1, 3),
  (2, 3),
  (3, 6),
  (4, 7),
  (5, 9),
  (6, 9)
).toDF("a", "b").createOrReplaceTempView("testData")

val x = sql(
  """
| SELECT b, sum(a) as a
| FROM testData
| GROUP BY b
| HAVING sum(a) > 3
  """.stripMargin)

x.explain()
x.show()
  }

[info] -  *** FAILED *** (508 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, sum(cast(a#180 
as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. Attribute(s) with the same 
name appear in the operation: a. Please check if the right attribute(s) are 
used.;;
[info] Project [b#181, a#184]
[info] +- Filter (sum(a#184)#188 > cast(3 as double))
[info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
sum(a#184) AS sum(a#184)#188]
[info]   +- SubqueryAlias `testdata`
[info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
[info] +- LocalRelation [_1#177, _2#178]
```
```
test("") {
Seq(
  ("1", "3"),
  ("2", "3"),
  ("3", "6"),
  ("4", "7"),
  ("5", "9"),
  ("6", "9")
).toDF("a", "b").createOrReplaceTempView("testData")

val x = sql(
  """
| SELECT b, sum(a) as a
| FROM testData
| GROUP BY b
| HAVING sum(a) > 3
  """.stripMargin)

x.explain()
x.show()
  }


== Physical Plan ==
*(2) Project [b#181, a#184L]
+- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
as bigint))#197L > 3))
   +- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
  +- Exchange hashpartitioning(b#181, 5)
 +- *(1) HashAggregate(keys=[b#181], functions=[partial_sum(cast(a#180 
as bigint))])
+- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
   +- LocalTableScan [_1#177, _2#178]
```{code}
Spend A lot of time I can't find witch analyzer make this different,

When column type is double, it failed.

> Use agg column in Having clause behave different with column type 
> --
>
> Key: SPARK-31334
> URL: https://issues.apache.org/jira/browse/SPARK-31334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> ```
> test("") {
> Seq(
>   (1, 3),
>   (2, 3),
>   (3, 6),
>   (4, 7),
>   (5, 9),
>   (6, 9)
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> [info] -  *** FAILED *** (508 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 
> missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, 
> sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. 
> Attribute(s) with the same name appear in the operation: a. Please check if 
> the right attribute(s) are used.;;
> [info] Project [b#181, a#184]
> [info] +- Filter (sum(a#184)#188 > cast(3 as double))
> [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, 
> sum(a#184) AS sum(a#184)#188]
> [info]   +- SubqueryAlias `testdata`
> [info]  +- Project [_1#177 AS a#180, _2#178 AS b#181]
> [info] +- LocalRelation [_1#177, _2#178]
> ```
> ```
> test("") {
> Seq(
>   ("1", "3"),
>   ("2", "3"),
>   ("3", "6"),
>   ("4", "7"),
>   ("5", "9"),
>   ("6", "9")
> ).toDF("a", "b").createOrReplaceTempView("testData")
> val x = sql(
>   """
> | SELECT b, sum(a) as a
> | FROM testData
> | GROUP BY b
> | HAVING sum(a) > 3
>   """.stripMargin)
> x.explain()
> x.show()
>   }
> == Physical Plan ==
> *(2) Project [b#181, a#184L]
> +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 
> as bigint))#197L > 3))
>+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))])
>   +- Exchange hashpartitioning(b#181, 5)
>  +- *(1) HashAggregate(keys=[b#181], 
> functions=[partial_sum(cast(a#180 as bigint))])
> +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181]
>+- LocalTableScan [_1#177, _2#178]
> ```{code}
> Spend A lot of time I can't find witch analyzer make this different,
> When column type is double, it failed.



--
This message was sent by 

[jira] [Created] (SPARK-31334) Use agg column in Having clause behave different with column type

2020-04-03 Thread angerszhu (Jira)
angerszhu created SPARK-31334:
-

 Summary: Use agg column in Having clause behave different with 
column type 
 Key: SPARK-31334
 URL: https://issues.apache.org/jira/browse/SPARK-31334
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org