[jira] [Created] (SPARK-31344) Polish implementation of barrier() and allGather()
wuyi created SPARK-31344: Summary: Polish implementation of barrier() and allGather() Key: SPARK-31344 URL: https://issues.apache.org/jira/browse/SPARK-31344 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 3.0.0 Reporter: wuyi Currently, implementation of barrier() and allGather() has much duplicate codes, we should polish them to make code simpler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging
[ https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074961#comment-17074961 ] Hyukjin Kwon commented on SPARK-31231: -- This was fixed in setuptools https://github.com/pypa/setuptools/pull/2046 > Support setuptools 46.1.0+ in PySpark packaging > --- > > Key: SPARK-31231 > URL: https://issues.apache.org/jira/browse/SPARK-31231 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Priority: Blocker > > PIP packaging test started to fail (see > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/) > as of setuptools 46.1.0 release. > In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep > the modes in {{package_data}}. In PySpark pip installation, we keep the > executable scripts in {{package_data}} > https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and > expose their symbolic links as executable scripts. > So, the symbolic links (or copied scripts) executes the scripts copied from > {{package_data}}, which didn't keep the modes: > {code} > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > Permission denied > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > cannot execute: Permission denied > {code} > The current issue is being tracked at > https://github.com/pypa/setuptools/issues/2041 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging
[ https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31231. -- Assignee: Hyukjin Kwon Resolution: Fixed > Support setuptools 46.1.0+ in PySpark packaging > --- > > Key: SPARK-31231 > URL: https://issues.apache.org/jira/browse/SPARK-31231 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > > PIP packaging test started to fail (see > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/) > as of setuptools 46.1.0 release. > In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep > the modes in {{package_data}}. In PySpark pip installation, we keep the > executable scripts in {{package_data}} > https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and > expose their symbolic links as executable scripts. > So, the symbolic links (or copied scripts) executes the scripts copied from > {{package_data}}, which didn't keep the modes: > {code} > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > Permission denied > /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: > /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: > cannot execute: Permission denied > {code} > The current issue is being tracked at > https://github.com/pypa/setuptools/issues/2041 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074889#comment-17074889 ] L. C. Hsieh commented on SPARK-31308: - Thank you [~dongjoon] > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31343) Check codegen does not fail on expressions with special characters in string parameters
Maxim Gekk created SPARK-31343: -- Summary: Check codegen does not fail on expressions with special characters in string parameters Key: SPARK-31343 URL: https://issues.apache.org/jira/browse/SPARK-31343 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Add tests similar to tests added by the PR https://github.com/apache/spark/pull/20182 for from_utc_timestamp / to_utc_timestamp -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31342) Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582
Bruce Robbins created SPARK-31342: - Summary: Fail by default if Parquet DATE or TIMESTAMP data is before October 15, 1582 Key: SPARK-31342 URL: https://issues.apache.org/jira/browse/SPARK-31342 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Bruce Robbins Some users may not know they are creating and/or reading DATE or TIMESTAMP data from before October 15, 1582 (because of data encoding libraries, etc.). Therefore, it may not be clear whether they need to toggle the two rebaseDateTime config settings. By default, Spark should fail if it reads or writes data from October 15, 1582 or before. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074745#comment-17074745 ] Bruce Robbins commented on SPARK-30951: --- {quote} we can fail by default when reading datetime values before 1582 from parquet files. {quote} That sounds reasonable. I can make a PR, but if someone beats me to it, I won't complain. > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Assignee: Maxim Gekk >Priority: Blocker > Labels: release-notes > Fix For: 3.0.0 > > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > checks a configuration setting to determine which calendar to use. > * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive > stores the calendar type in the metadata. > * When reading date or timestamp data from ORC, Parquet, and Avro files, > Hive checks the metadata for the calendar type. > * When reading date or timestamp data from ORC, Parquet, and Avro files that > lack calendar metadata, Hive's behavior is determined by a configuration > setting. This allows Hive to read legacy data
[jira] [Created] (SPARK-31341) Spark documentation incorrectly claims 3.8 compatibility
Daniel King created SPARK-31341: --- Summary: Spark documentation incorrectly claims 3.8 compatibility Key: SPARK-31341 URL: https://issues.apache.org/jira/browse/SPARK-31341 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5 Reporter: Daniel King The Spark documentation ([https://spark.apache.org/docs/latest/]) has this text: {quote}Spark runs on Java 8, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.4.5 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x). {quote} Which suggests that Spark is compatible with Python 3.8. This is not true. For example in the latest ubuntu:18.04 docker image: {code:python} apt-get update apt-get install python3.8 python3-pip pip3 install pyspark python3.8 -m pip install pyspark python3.8 -c 'import pyspark' {code} Outputs: {code:python} Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/dist-packages/pyspark/__init__.py", line 51, in from pyspark.context import SparkContext File "/usr/local/lib/python3.8/dist-packages/pyspark/context.py", line 31, in from pyspark import accumulators File "/usr/local/lib/python3.8/dist-packages/pyspark/accumulators.py", line 97, in from pyspark.serializers import read_int, PickleSerializer File "/usr/local/lib/python3.8/dist-packages/pyspark/serializers.py", line 72, in from pyspark import cloudpickle File "/usr/local/lib/python3.8/dist-packages/pyspark/cloudpickle.py", line 145, in _cell_set_template_code = _make_cell_set_template_code() File "/usr/local/lib/python3.8/dist-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code return types.CodeType( TypeError: an integer is required (got type bytes) {code} I propose the documentation is updated to say "Python 3.4 to 3.7". I also propose the `setup.py` file for pyspark include: {code:python} python_requires=">=3.6,<3.8", {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074689#comment-17074689 ] Dongjoon Hyun commented on SPARK-25102: --- Are we going to have 2.4.7 or 2.4.8? For now, 2.4.6 is the last planned release. Could you send an email to dev mailing list about your LTS plan at 2.4.x first? cc [~dbtsai] > Write Spark version to ORC/Parquet file metadata > > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark writes Spark version number into Hive Table properties with > `spark.sql.create.version`. > {code} > parameters:{ > spark.sql.sources.schema.part.0={ > "type":"struct", > "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] > }, > transient_lastDdlTime=1541142761, > spark.sql.sources.schema.numParts=1, > spark.sql.create.version=2.4.0 > } > {code} > This issue aims to write Spark versions to ORC/Parquet file metadata with > `org.apache.spark.sql.create.version`. It's different from Hive Table > property key `spark.sql.create.version`. It seems that we cannot change that > for backward compatibility (even in Apache Spark 3.0) > *ORC* > {code} > User Metadata: > org.apache.spark.sql.create.version=3.0.0-SNAPSHOT > {code} > *PARQUET* > {code} > file: > file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Huang reopened SPARK-31276: --- Clarify the question with more succinct scenario that describe the the challenge of URI not being able to differentiate between driver vs executor. > Contrived working example that works with multiple URI file storages for > Spark cluster mode > --- > > Key: SPARK-31276 > URL: https://issues.apache.org/jira/browse/SPARK-31276 > Project: Spark > Issue Type: Wish > Components: Examples >Affects Versions: 2.4.5 >Reporter: Jim Huang >Priority: Major > > This Spark SQL Guide --> Data sources --> Generic Load/Save Functions > [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] > described a very simple "local file system load of an example file". > > I am looking for an example that demonstrates a workflow that exercises > different file systems. For example, > # Driver loads an input file from local file system > # Add a simple column using lit() and stores that DataFrame in cluster mode > to HDFS > # Write that a small limited subset of that DataFrame back to Driver's local > file system. (This is to avoid the anti-pattern of writing large file and > out of the scope for this example. The small limited DataFrame would be some > basic statistics, not the actual complete dataset.) > > The examples I found on the internet only uses simple paths without the > explicit URI prefixes. > Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) > was called, local stand alone vs YARN client mode. So a "filepath" will be > read/write locally (file system) vs cluster mode HDFS, without these explicit > URIs. > There are situations were a Spark program needs to deal with both local file > system and YARN client mode (big data) in the same Spark application, like > producing a summary table stored on the local file system of the driver at > the end. > If there are any existing alternatives Spark documentation that provides > examples that traverse through the different URIs in Spark YARN client mode > or a better or smarter Spark pattern or API that is more suited for this, I > am happy to accept that as well. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074677#comment-17074677 ] Jim Huang edited comment on SPARK-31276 at 4/3/20, 3:51 PM: Clarified the question with more succinct scenario that describe the the challenge of URI not being able to differentiate between driver vs executor. was (Author: jimhuang): Clarify the question with more succinct scenario that describe the the challenge of URI not being able to differentiate between driver vs executor. > Contrived working example that works with multiple URI file storages for > Spark cluster mode > --- > > Key: SPARK-31276 > URL: https://issues.apache.org/jira/browse/SPARK-31276 > Project: Spark > Issue Type: Wish > Components: Examples >Affects Versions: 2.4.5 >Reporter: Jim Huang >Priority: Major > > This Spark SQL Guide --> Data sources --> Generic Load/Save Functions > [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] > described a very simple "local file system load of an example file". > > I am looking for an example that demonstrates a workflow that exercises > different file systems. For example, > # Driver loads an input file from local file system > # Add a simple column using lit() and stores that DataFrame in cluster mode > to HDFS > # Write that a small limited subset of that DataFrame back to Driver's local > file system. (This is to avoid the anti-pattern of writing large file and > out of the scope for this example. The small limited DataFrame would be some > basic statistics, not the actual complete dataset.) > > The examples I found on the internet only uses simple paths without the > explicit URI prefixes. > Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) > was called, local stand alone vs YARN client mode. So a "filepath" will be > read/write locally (file system) vs cluster mode HDFS, without these explicit > URIs. > There are situations were a Spark program needs to deal with both local file > system and YARN client mode (big data) in the same Spark application, like > producing a summary table stored on the local file system of the driver at > the end. > If there are any existing alternatives Spark documentation that provides > examples that traverse through the different URIs in Spark YARN client mode > or a better or smarter Spark pattern or API that is more suited for this, I > am happy to accept that as well. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31276) Contrived working example that works with multiple URI file storages for Spark cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074676#comment-17074676 ] Jim Huang commented on SPARK-31276: --- The fault is my for not framing the scenario more succinctly. In the scenario when spark process is launched / deployed as `client` mode in YARN cluster: # What is the opposite of " {{sc.parallelize(data)}}"? ## Using "file:///" URI, writes the file "locally" on the executors nodes' file system, not the driver's local FS. The challenge I have is that I am unable to explicitly specify the "file:///" URI to differentiate the driver vs executor. I am aware this is an possible corner case that can be exploited unintentionally when that dataset is too big and overflow the driver memory. But it is also a valid use case if an user just want to store some summary statistics after some big data set processing on the driver local file system. > Contrived working example that works with multiple URI file storages for > Spark cluster mode > --- > > Key: SPARK-31276 > URL: https://issues.apache.org/jira/browse/SPARK-31276 > Project: Spark > Issue Type: Wish > Components: Examples >Affects Versions: 2.4.5 >Reporter: Jim Huang >Priority: Major > > This Spark SQL Guide --> Data sources --> Generic Load/Save Functions > [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] > described a very simple "local file system load of an example file". > > I am looking for an example that demonstrates a workflow that exercises > different file systems. For example, > # Driver loads an input file from local file system > # Add a simple column using lit() and stores that DataFrame in cluster mode > to HDFS > # Write that a small limited subset of that DataFrame back to Driver's local > file system. (This is to avoid the anti-pattern of writing large file and > out of the scope for this example. The small limited DataFrame would be some > basic statistics, not the actual complete dataset.) > > The examples I found on the internet only uses simple paths without the > explicit URI prefixes. > Without the explicit URI prefixes, the "filepath" inherits how Spark (mode) > was called, local stand alone vs YARN client mode. So a "filepath" will be > read/write locally (file system) vs cluster mode HDFS, without these explicit > URIs. > There are situations were a Spark program needs to deal with both local file > system and YARN client mode (big data) in the same Spark application, like > producing a summary table stored on the local file system of the driver at > the end. > If there are any existing alternatives Spark documentation that provides > examples that traverse through the different URIs in Spark YARN client mode > or a better or smarter Spark pattern or API that is more suited for this, I > am happy to accept that as well. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31340) No call to destroy() for filter in SparkHistory
[ https://issues.apache.org/jira/browse/SPARK-31340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] thierry accart updated SPARK-31340: --- Summary: No call to destroy() for filter in SparkHistory (was: No call to destroy() for authentication filter in SparkHistory) > No call to destroy() for filter in SparkHistory > --- > > Key: SPARK-31340 > URL: https://issues.apache.org/jira/browse/SPARK-31340 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: thierry accart >Priority: Major > > Adding UI filter AuthenticationFilter (from Hadoop) causes Spark application > to never end, due to threads created in this class not interrupted. > *To reproduce* > Start a local spark context with hadoop-auth 3.1.0 > {{spark.ui.enabled=true}} > {{spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter}} > {{#and all required ldap props}} > {{ > spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.param.ldap.*=...}} > *What's happening :* > In [AuthenticationFilter's > |https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/server/AuthenticationFilter.java] > init we have the following chain: > {{(line.178) initializeSecretProvider(filterConfig);}} > {{(}}{{line.209) secretProvider = constructSecretProvider(...)}} > {{(}}{{line 237) provider.init(config, ctx, validity);}} > If no config is specified provider will be [RolloverSignerSecretProvider > |https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/util/RolloverSignerSecretProvider.java]which > will (line 95) start a new thread via > {{scheduler = Executors.newSingleThreadScheduledExecutor();}} > The created thread will be stopped in destroy() method (line 106). > *Unfortunately, this destroy() method is not called* when SparkHistory is > closed, leaving threads running. > > This ticket is not here to address the particular case of Hadoop's > authentication filter, but to ensure that any Filter added in spark.ui will > have its destroy() method called. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31340) No call to destroy() for authentication filter in SparkHistory
thierry accart created SPARK-31340: -- Summary: No call to destroy() for authentication filter in SparkHistory Key: SPARK-31340 URL: https://issues.apache.org/jira/browse/SPARK-31340 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.5 Reporter: thierry accart Adding UI filter AuthenticationFilter (from Hadoop) causes Spark application to never end, due to threads created in this class not interrupted. *To reproduce* Start a local spark context with hadoop-auth 3.1.0 {{spark.ui.enabled=true}} {{spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter}} {{#and all required ldap props}} {{ spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.param.ldap.*=...}} *What's happening :* In [AuthenticationFilter's |https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/server/AuthenticationFilter.java] init we have the following chain: {{(line.178) initializeSecretProvider(filterConfig);}} {{(}}{{line.209) secretProvider = constructSecretProvider(...)}} {{(}}{{line 237) provider.init(config, ctx, validity);}} If no config is specified provider will be [RolloverSignerSecretProvider |https://github.com/apache/hadoop/blob/branch-3.1/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/util/RolloverSignerSecretProvider.java]which will (line 95) start a new thread via {{scheduler = Executors.newSingleThreadScheduledExecutor();}} The created thread will be stopped in destroy() method (line 106). *Unfortunately, this destroy() method is not called* when SparkHistory is closed, leaving threads running. This ticket is not here to address the particular case of Hadoop's authentication filter, but to ensure that any Filter added in spark.ui will have its destroy() method called. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31327) write spark version to avro file metadata
[ https://issues.apache.org/jira/browse/SPARK-31327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31327. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28102 [https://github.com/apache/spark/pull/28102] > write spark version to avro file metadata > - > > Key: SPARK-31327 > URL: https://issues.apache.org/jira/browse/SPARK-31327 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30840) Add version property for ConfigEntry and ConfigBuilder
[ https://issues.apache.org/jira/browse/SPARK-30840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30840: - Fix Version/s: 3.0.0 > Add version property for ConfigEntry and ConfigBuilder > -- > > Key: SPARK-30840 > URL: https://issues.apache.org/jira/browse/SPARK-30840 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0, 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()
[ https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj updated SPARK-31339: -- Description: PR: [https://github.com/apache/spark/pull/28110] * What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) * Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: {code:java} CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform(){code} * Does this introduce any user-facing change? No. was: PR: [https://github.com/apache/spark/pull/28110] What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: {code:java} CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform(){code} Does this introduce any user-facing change? No. > Changed PipelineModel(...) to self.cls(...) in > pyspark.ml.pipeline.PipelineModelReader.load() > - > > Key: SPARK-31339 > URL: https://issues.apache.org/jira/browse/SPARK-31339 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Suraj >Priority: Minor > Labels: pull-request-available > Original Estimate: 0h > Remaining Estimate: 0h > > PR: [https://github.com/apache/spark/pull/28110] > * What changes were proposed in this pull request? > pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) > * Why are the changes needed? > This change fixes the loading of class (which inherits from PipelineModel > class) from file. > E.g. Current issue: > {code:java} > CustomPipelineModel(PipelineModel): > def _transform(self, df): > ... > CustomPipelineModel.save('path/to/file') # works > CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() > instead of CustomPipelineModel() > CustomPipelineModel.transform() # wrong: results in calling > PipelineModel.transform() instead of CustomPipelineModel.transform(){code} > * Does this introduce any user-facing change? > No. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()
[ https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj updated SPARK-31339: -- Description: PR: [https://github.com/apache/spark/pull/28110] What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: {code:java} CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform(){code} Does this introduce any user-facing change? No. was: PR: [https://github.com/apache/spark/pull/28110] What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: ``` CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform() ``` Does this introduce any user-facing change? No. > Changed PipelineModel(...) to self.cls(...) in > pyspark.ml.pipeline.PipelineModelReader.load() > - > > Key: SPARK-31339 > URL: https://issues.apache.org/jira/browse/SPARK-31339 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Suraj >Priority: Minor > Labels: pull-request-available > Original Estimate: 0h > Remaining Estimate: 0h > > PR: [https://github.com/apache/spark/pull/28110] > What changes were proposed in this pull request? > pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) > Why are the changes needed? > This change fixes the loading of class (which inherits from PipelineModel > class) from file. > E.g. Current issue: > {code:java} > CustomPipelineModel(PipelineModel): > def _transform(self, df): > ... > CustomPipelineModel.save('path/to/file') # works > CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() > instead of CustomPipelineModel() > CustomPipelineModel.transform() # wrong: results in calling > PipelineModel.transform() instead of CustomPipelineModel.transform(){code} > Does this introduce any user-facing change? > No. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()
[ https://issues.apache.org/jira/browse/SPARK-31339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj updated SPARK-31339: -- Description: PR: [https://github.com/apache/spark/pull/28110] What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: ``` CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform() ``` Does this introduce any user-facing change? No. was: PR: [https://github.com/apache/spark/pull/28110] ### What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) ### Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: ``` CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform() ``` ### Does this introduce any user-facing change? No. > Changed PipelineModel(...) to self.cls(...) in > pyspark.ml.pipeline.PipelineModelReader.load() > - > > Key: SPARK-31339 > URL: https://issues.apache.org/jira/browse/SPARK-31339 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Suraj >Priority: Minor > Labels: pull-request-available > Original Estimate: 0h > Remaining Estimate: 0h > > PR: [https://github.com/apache/spark/pull/28110] > What changes were proposed in this pull request? > pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) > Why are the changes needed? > This change fixes the loading of class (which inherits from PipelineModel > class) from file. > E.g. Current issue: > ``` > CustomPipelineModel(PipelineModel): > def _transform(self, df): > ... > CustomPipelineModel.save('path/to/file') # works > CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() > instead of CustomPipelineModel() > CustomPipelineModel.transform() # wrong: results in calling > PipelineModel.transform() instead of CustomPipelineModel.transform() > ``` > Does this introduce any user-facing change? > No. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31339) Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load()
Suraj created SPARK-31339: - Summary: Changed PipelineModel(...) to self.cls(...) in pyspark.ml.pipeline.PipelineModelReader.load() Key: SPARK-31339 URL: https://issues.apache.org/jira/browse/SPARK-31339 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.4.5 Reporter: Suraj PR: [https://github.com/apache/spark/pull/28110] ### What changes were proposed in this pull request? pypsark.ml.pipeline.py line 245: Change PipelineModel(...) to self.cls(...) ### Why are the changes needed? This change fixes the loading of class (which inherits from PipelineModel class) from file. E.g. Current issue: ``` CustomPipelineModel(PipelineModel): def _transform(self, df): ... CustomPipelineModel.save('path/to/file') # works CustomPipelineModel.load('path/to/file') # wrong: results in PipelineModel() instead of CustomPipelineModel() CustomPipelineModel.transform() # wrong: results in calling PipelineModel.transform() instead of CustomPipelineModel.transform() ``` ### Does this introduce any user-facing change? No. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.
[ https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Dave updated SPARK-31338: --- Description: h2. *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). *Table defination :* postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -++-- *l_orderkey | bigint | not null* l_partkey | bigint | not null l_suppkey | bigint | not null l_linenumber | bigint | not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag | character varying(1) | not null l_linestatus | character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate | character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not null Indexes: "l_order_sf1000_idx" btree (l_orderkey) *Partition column* : l_orderkey *numpartion* : 16 h2. *Problem details :* {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND l_orderkey < 187501 {code} 15 queries are generated with the above BETWEEN clauses. The last query looks like this below: {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or l_orderkey is null {code} I*n the last query, we are trying to get the remaining records, along with any data in the table for the partition key having NULL values.* This hurts performance badly. While the first 15 SQLs took approximately 10 minutes to execute, the last SQL with the NULL check takes 45 minutes because it has to evaluate a second scan(OR clause) of the table for NULL values for the partition key. *Note that I have defined the partition key of the table to be NOT NULL, at the database. Therefore, the SQL for the last partition need not have this NULL check, Spark SQl should be able to avoid such condition and this Jira is intended to fix this behavior.* {code:java} {code} was: *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). Table defination : postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -+---+--- l_orderkey | bigint | not null l_partkey | bigint | not null l_suppkey | bigint | not null l_linenumber | bigint | not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag | character varying(1) | not null l_linestatus | character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate | character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not null Indexes: "l_order_sf1000_idx" btree (l_orderkey) Partition column : l_orderkey numpartion : 16 *Problem details :* {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment
[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.
[ https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Dave updated SPARK-31338: --- Description: *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). Table defination : postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -+---+--- l_orderkey | bigint | not null l_partkey | bigint | not null l_suppkey | bigint | not null l_linenumber | bigint | not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag | character varying(1) | not null l_linestatus | character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate | character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not null Indexes: "l_order_sf1000_idx" btree (l_orderkey) Partition column : l_orderkey numpartion : 16 *Problem details :* {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND l_orderkey < 187501 {code} 15 queries are generated with the above BETWEEN clauses. The last query looks like this below: {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or l_orderkey is null {code} I*n the last query, we are trying to get the remaining records, along with any data in the table for the partition key having NULL values.* This hurts performance badly. While the first 15 SQLs took approximately 10 minutes to execute, the last SQL with the NULL check takes 45 minutes because it has to evaluate a second scan(OR clause) of the table for NULL values for the partition key. *Note that I have defined the partition key of the table to be NOT NULL, at the database. Therefore, the SQL for the last partition need not have this NULL check, Spark SQl should be able to avoid such condition and this Jira is intended to fix this behavior.* {code:java} {code} was: *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). Table defination : postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -+---+--- l_orderkey | bigint| not null l_partkey | bigint | not null l_suppkey | bigint| not null l_linenumber| bigint| not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag| character varying(1) | not null l_linestatus| character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate| character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not nullIndexes: "l_order_sf1000_idx" btree (l_orderkey) Partition column : l_orderkey numpartion : 16 *Problem details :* {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT
[jira] [Updated] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.
[ https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Dave updated SPARK-31338: --- Description: *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). Table defination : postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -+---+--- l_orderkey | bigint| not null l_partkey | bigint | not null l_suppkey | bigint| not null l_linenumber| bigint| not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag| character varying(1) | not null l_linestatus| character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate| character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not nullIndexes: "l_order_sf1000_idx" btree (l_orderkey) Partition column : l_orderkey numpartion : 16 *Problem details :* {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND l_orderkey < 187501 {code} 15 queries are generated with the above BETWEEN clauses. The last query looks like this below: {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or l_orderkey is null {code} I*n the last query, we are trying to get the remaining records, along with any data in the table for the partition key having NULL values.* This hurts performance badly. While the first 15 SQLs took approximately 10 minutes to execute, the last SQL with the NULL check takes 45 minutes because it has to evaluate a second scan(OR clause) of the table for NULL values for the partition key. *Note that I have defined the partition key of the table to be NOT NULL, at the database. Therefore, the SQL for the last partition need not have this NULL check, Spark SQl should be able to avoid such condition and this Jira is intended to fix this behavior.* {code:java} {code} was: *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). Table defination : postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -+---+--- l_orderkey | bigint| not *null* l_partkey | bigint | not null l_suppkey | bigint| not null l_linenumber| bigint| not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag| character varying(1) | not null l_linestatus| character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate| character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not nullIndexes: "l_order_sf1000_idx" btree (l_orderkey) Partition column : l_orderkey numpartion : 16 *Problem details :* {code:java} SELECT
[jira] [Created] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.
Mohit Dave created SPARK-31338: -- Summary: Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key. Key: SPARK-31338 URL: https://issues.apache.org/jira/browse/SPARK-31338 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5 Reporter: Mohit Dave *Our Use-case Details:* While reading from a jdbc source using spark sql, we are using below read format : jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties). Table defination : postgres=> \d lineitem_sf1000 Table "public.lineitem_sf1000" Column | Type | Modifiers -+---+--- l_orderkey | bigint| not *null* l_partkey | bigint | not null l_suppkey | bigint| not null l_linenumber| bigint| not null l_quantity | numeric(10,2) | not null l_extendedprice | numeric(10,2) | not null l_discount | numeric(10,2) | not null l_tax | numeric(10,2) | not null l_returnflag| character varying(1) | not null l_linestatus| character varying(1) | not null l_shipdate | character varying(29) | not null l_commitdate| character varying(29) | not null l_receiptdate | character varying(29) | not null l_shipinstruct | character varying(25) | not null l_shipmode | character varying(10) | not null l_comment | character varying(44) | not nullIndexes: "l_order_sf1000_idx" btree (l_orderkey) Partition column : l_orderkey numpartion : 16 *Problem details :* {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 150001 AND l_orderkey < 187501 {code} 15 queries are generated with the above BETWEEN clauses. The last query looks like this below: {code:java} SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 37501 or l_orderkey is null {code} I*n the last query, we are trying to get the remaining records, along with any data in the table for the partition key having NULL values.* This hurts performance badly. While the first 15 SQLs took approximately 10 minutes to execute, the last SQL with the NULL check takes 45 minutes because it has to evaluate a second scan(OR clause) of the table for NULL values for the partition key. *Note that I have defined the partition key of the table to be NOT NULL, at the database. Therefore, the SQL for the last partition need not have this NULL check, Spark SQl should be able to avoid such condition and this Jira is intended to fix this behavior.* {code} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074434#comment-17074434 ] Wenchen Fan commented on SPARK-25102: - I'd like to propose to backport it to 2.4. It's very important to have version info in the file metadata, to implement backward compatibility. It's unfortunate that we start this too late, but it still helps if Spark 2.4.6 starts to do it, as we will maintain the 2.4 line for a long time. Any thoughts? > Write Spark version to ORC/Parquet file metadata > > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark writes Spark version number into Hive Table properties with > `spark.sql.create.version`. > {code} > parameters:{ > spark.sql.sources.schema.part.0={ > "type":"struct", > "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] > }, > transient_lastDdlTime=1541142761, > spark.sql.sources.schema.numParts=1, > spark.sql.create.version=2.4.0 > } > {code} > This issue aims to write Spark versions to ORC/Parquet file metadata with > `org.apache.spark.sql.create.version`. It's different from Hive Table > property key `spark.sql.create.version`. It seems that we cannot change that > for backward compatibility (even in Apache Spark 3.0) > *ORC* > {code} > User Metadata: > org.apache.spark.sql.create.version=3.0.0-SNAPSHOT > {code} > *PARQUET* > {code} > file: > file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31272) Support DB2 Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074421#comment-17074421 ] Gabor Somogyi commented on SPARK-31272: --- The code is ready but it's depending on the MariaDB PR. Intended to file a PR when MariaDB is ready... > Support DB2 Kerberos login in JDBC connector > > > Key: SPARK-31272 > URL: https://issues.apache.org/jira/browse/SPARK-31272 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31336) Support Oracle Kerberos login in JDBC connector
Gabor Somogyi created SPARK-31336: - Summary: Support Oracle Kerberos login in JDBC connector Key: SPARK-31336 URL: https://issues.apache.org/jira/browse/SPARK-31336 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31337) Support MS Sql Kerberos login in JDBC connector
Gabor Somogyi created SPARK-31337: - Summary: Support MS Sql Kerberos login in JDBC connector Key: SPARK-31337 URL: https://issues.apache.org/jira/browse/SPARK-31337 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch
[ https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074390#comment-17074390 ] Ismaël Mejía commented on SPARK-31330: -- Ah thanks for letting me know about the mail I did not do reply-all. i think Hyukjin already forwarded it so we should be good. Don't hesitate to ping me in the PR or in the INFRA ticket if you need some ref/help. > Automatically label PRs based on the paths they touch > - > > Key: SPARK-31330 > URL: https://issues.apache.org/jira/browse/SPARK-31330 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We can potentially leverage the added labels to drive testing, review, or > other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31334: -- Comment: was deleted (was: cc [~cloud_fan] [~yumwang] ) > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074381#comment-17074381 ] angerszhu commented on SPARK-31334: --- I have found the reason, In analyzer when logical plan {code:java} 'Filter ('sum('a) > 3) +- Aggregate [b#181], [b#181, sum(a#180) AS a#184L] +- SubqueryAlias `testdata` +- Project [_1#177 AS a#180, _2#178 AS b#181] +- LocalRelation [_1#177, _2#178] {code} come into ResolveAggregateFunctions, since a is String type and then aggregation's expression is unresolved, so ResolveAggregateFunctions won't make a change on above logicalplan, then `sum(a)` in Filter condition will be resolved in ResolveReference and this {color:#FF}a {color}{color:#172b4d}will be resolved as aggregation's output column a , then error happened{color}{color:#FF} .{color} > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception
[ https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074361#comment-17074361 ] philipse commented on SPARK-18681: -- [~michael] any news for this issue ? i meet the same issue on spark2.4.5 > Throw Filtering is supported only on partition keys of type string exception > > > Key: SPARK-18681 > URL: https://issues.apache.org/jira/browse/SPARK-18681 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 2.1.0 > > > Cloudera put > {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}} > as the configuration file for the Hive Metastore Server, where > {{hive.metastore.try.direct.sql=false}}. But Spark reading the gateway > configuration file and get default value > {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or > {{getMSC.getConfigValue}} method to obtain the original configuration from > Hive Metastore Server. > {noformat} > spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT); > Time taken: 0.221 seconds > spark-sql> select * from test where part=1 limit 10; > 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from > test where part=1 limit 10] > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938) > at > org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) > at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435) > at > org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133) > at >
[jira] [Created] (SPARK-31335) Add try function support
Kent Yao created SPARK-31335: Summary: Add try function support Key: SPARK-31335 URL: https://issues.apache.org/jira/browse/SPARK-31335 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao {code:java} Evaluate an expression and handle certain types of execution errors by returning NULL. In cases where it is preferable that queries produce NULL instead of failing when corrupt or invalid data is encountered, the TRY function may be useful especially when ANSI mode is on and the users need null-tolerant on certain columns or outputs. AnalysisExceptions will not handle by this, typically errors handled by TRY function are: * Division by zero, * Invalid casting, * Numeric value out of range, * e.t.c {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31249) Flaky Test: CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is applied
[ https://issues.apache.org/jira/browse/SPARK-31249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31249. -- Fix Version/s: 3.0.0 Assignee: wuyi Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28100 > Flaky Test: CoarseGrainedSchedulerBackendSuite.custom log url for Spark UI is > applied > - > > Key: SPARK-31249 > URL: https://issues.apache.org/jira/browse/SPARK-31249 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120302/testReport/ > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did > not equal 3 > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > at > org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.$anonfun$new$11(CoarseGrainedSchedulerBackendSuite.scala:186) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286) > at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) > at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074296#comment-17074296 ] angerszhu commented on SPARK-31334: --- cc [~cloud_fan] [~yumwang] > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31334) Use agg column in Having clause behave different with column type
[ https://issues.apache.org/jira/browse/SPARK-31334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31334: -- Description: {code:java} ``` test("") { Seq( (1, 3), (2, 3), (3, 6), (4, 7), (5, 9), (6, 9) ).toDF("a", "b").createOrReplaceTempView("testData") val x = sql( """ | SELECT b, sum(a) as a | FROM testData | GROUP BY b | HAVING sum(a) > 3 """.stripMargin) x.explain() x.show() } [info] - *** FAILED *** (508 milliseconds) [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. Attribute(s) with the same name appear in the operation: a. Please check if the right attribute(s) are used.;; [info] Project [b#181, a#184] [info] +- Filter (sum(a#184)#188 > cast(3 as double)) [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188] [info] +- SubqueryAlias `testdata` [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] [info] +- LocalRelation [_1#177, _2#178] ``` ``` test("") { Seq( ("1", "3"), ("2", "3"), ("3", "6"), ("4", "7"), ("5", "9"), ("6", "9") ).toDF("a", "b").createOrReplaceTempView("testData") val x = sql( """ | SELECT b, sum(a) as a | FROM testData | GROUP BY b | HAVING sum(a) > 3 """.stripMargin) x.explain() x.show() } == Physical Plan == *(2) Project [b#181, a#184L] +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 as bigint))#197L > 3)) +- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) +- Exchange hashpartitioning(b#181, 5) +- *(1) HashAggregate(keys=[b#181], functions=[partial_sum(cast(a#180 as bigint))]) +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] +- LocalTableScan [_1#177, _2#178] ```{code} Spend A lot of time I can't find witch analyzer make this different, When column type is double, it failed. > Use agg column in Having clause behave different with column type > -- > > Key: SPARK-31334 > URL: https://issues.apache.org/jira/browse/SPARK-31334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > {code:java} > ``` > test("") { > Seq( > (1, 3), > (2, 3), > (3, 6), > (4, 7), > (5, 9), > (6, 9) > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > [info] - *** FAILED *** (508 milliseconds) > [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) a#184 > missing from a#180,b#181 in operator !Aggregate [b#181], [b#181, > sum(cast(a#180 as double)) AS a#184, sum(a#184) AS sum(a#184)#188]. > Attribute(s) with the same name appear in the operation: a. Please check if > the right attribute(s) are used.;; > [info] Project [b#181, a#184] > [info] +- Filter (sum(a#184)#188 > cast(3 as double)) > [info]+- !Aggregate [b#181], [b#181, sum(cast(a#180 as double)) AS a#184, > sum(a#184) AS sum(a#184)#188] > [info] +- SubqueryAlias `testdata` > [info] +- Project [_1#177 AS a#180, _2#178 AS b#181] > [info] +- LocalRelation [_1#177, _2#178] > ``` > ``` > test("") { > Seq( > ("1", "3"), > ("2", "3"), > ("3", "6"), > ("4", "7"), > ("5", "9"), > ("6", "9") > ).toDF("a", "b").createOrReplaceTempView("testData") > val x = sql( > """ > | SELECT b, sum(a) as a > | FROM testData > | GROUP BY b > | HAVING sum(a) > 3 > """.stripMargin) > x.explain() > x.show() > } > == Physical Plan == > *(2) Project [b#181, a#184L] > +- *(2) Filter (isnotnull(sum(cast(a#180 as bigint))#197L) && (sum(cast(a#180 > as bigint))#197L > 3)) >+- *(2) HashAggregate(keys=[b#181], functions=[sum(cast(a#180 as bigint))]) > +- Exchange hashpartitioning(b#181, 5) > +- *(1) HashAggregate(keys=[b#181], > functions=[partial_sum(cast(a#180 as bigint))]) > +- *(1) Project [_1#177 AS a#180, _2#178 AS b#181] >+- LocalTableScan [_1#177, _2#178] > ```{code} > Spend A lot of time I can't find witch analyzer make this different, > When column type is double, it failed. -- This message was sent by
[jira] [Created] (SPARK-31334) Use agg column in Having clause behave different with column type
angerszhu created SPARK-31334: - Summary: Use agg column in Having clause behave different with column type Key: SPARK-31334 URL: https://issues.apache.org/jira/browse/SPARK-31334 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org