[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459494#comment-17459494 ] Nicholas Chammas commented on SPARK-25150: -- I re-ran my test (described in the issue description + summarized in my comment just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with cross joins enabled or disabled, I now get the correct results. Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran this test) was responsible for the fix. But to be clear, in case anyone wants to reproduce my test: # Download all 6 files attached to this issue into a folder. # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and inspect the output. # Then, enable cross joins (commented out on line 9), rerun the script, and reinspect the output. # Compare the final bit of output from both runs against {{{}expected-output.txt{}}}. > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.3 >Reporter: Nicholas Chammas >Priority: Major > Labels: correctness > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459467#comment-17459467 ] Nicholas Chammas commented on SPARK-24853: -- [~hyukjin.kwon] - Are you still opposed to this proposed improvement? If not, I'd like to work on it. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-26589. -- Resolution: Won't Fix Marking this as "Won't Fix", but I suppose if someone really wanted to, they could reopen this issue and propose adding a median function that is simply an alias for {{{}percentile(col, 0.5){}}}. Don't know how the committers would feel about that. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459455#comment-17459455 ] Nicholas Chammas commented on SPARK-26589: -- It looks like making a distributed, memory-efficient implementation of median is not possible using the design of Catalyst as it stands today. For more details, please see [this thread on the dev list|http://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3cCAOhmDzev8d4H20XT1hUP9us=cpjeysgcf+xev7lg7dka1gj...@mail.gmail.com%3e]. It's possible to get an exact median today by using {{{}percentile(col, 0.5){}}}, which is [available via the SQL API|https://spark.apache.org/docs/3.2.0/sql-ref-functions-builtin.html#aggregate-functions]. It's not memory-efficient, so it may not work well on large datasets. The Python and Scala DataFrame APIs do not offer this exact percentile function, so I've filed SPARK-37647 to track exposing this function in those APIs. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37647) Expose percentile function in Scala/Python APIs
Nicholas Chammas created SPARK-37647: Summary: Expose percentile function in Scala/Python APIs Key: SPARK-37647 URL: https://issues.apache.org/jira/browse/SPARK-37647 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.2.0 Reporter: Nicholas Chammas SQL offers a percentile function (exact, not approximate) that is not available directly in the Scala or Python DataFrame APIs. While it is possible to invoke SQL functions from Scala or Python via {{{}expr(){}}}, I think most users expect function parity across Scala, Python, and SQL. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456860#comment-17456860 ] Nicholas Chammas commented on SPARK-26589: -- That makes sense to me. I've been struggling with how to approach the implementation, so I've posted to the dev list asking for a little more help. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452081#comment-17452081 ] Nicholas Chammas commented on SPARK-26589: -- [~srowen] - I'll ask for help on the dev list if appropriate, but I'm wondering if you can give me some high level guidance here. I have an outline of an approach to calculate the median that does not require sorting or shuffling the data. It's based on the approach I linked to in my previous comment (by Michael Harris). It does require, however, multiple passes over the data for the algorithm to converge on the median. Here's a working sketch of the approach: {code:python} def spark_median(data): total_count = data.count() if total_count % 2 == 0: target_positions = [total_count // 2, total_count // 2 + 1] else: target_positions = [total_count // 2 + 1] target_values = [ kth_position(data, k, data_count=total_count) for k in target_positions ] return sum(target_values) / len(target_values) def kth_position(data, k, data_count=None): if data_count is None: total_count = data.count() else: total_count = data_count if k > total_count or k < 1: return None while True: # This value, along with the following two counts, are the only data that need # to be shared across nodes. some_value = data.first()["id"] # These two counts can be performed together via an aggregator. larger_count = data.where(col("id") > some_value).count() equal_count = data.where(col("id") == some_value).count() value_positions = range( total_count - larger_count - equal_count + 1, total_count - larger_count + 1, ) # print(some_value, total_count, k, value_positions) if k in value_positions: return some_value elif k >= value_positions.stop: k -= (value_positions.stop - 1) data = data.where(col("id") > some_value) total_count = larger_count elif k < value_positions.start: data = data.where(col("id") < some_value) total_count -= (larger_count + equal_count) {code} Of course, this needs to be converted into a Catalyst Expression, but the basic idea is expressed there. I am looking at the definitions of [DeclarativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L381-L394] and [ImperativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L267-L285] and trying to find an existing expression to model after, but I don't think we have any existing aggregates that would work like this median—specifically, where multiple passes over the data are required (in this case, to count elements matching different filters). Do you have any advice on how to approach converting this into a Catalyst expression? There is an [NthValue|https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L648-L675] window expression, but I don't think I can build on it to make my median expression since a) median shouldn't be limited to window expressions, and b) NthValue requires a complete sort of the data, which I want to avoid. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451936#comment-17451936 ] Nicholas Chammas commented on SPARK-26589: -- Just for reference, Stack Overflow provides evidence that a proper median function has been in high demand for some time: * [How can I calculate exact median with Apache Spark?|https://stackoverflow.com/q/28158729/877069] (14K views) * [How to find median and quantiles using Spark|https://stackoverflow.com/q/31432843/877069] (117K views) * [Median / quantiles within PySpark groupBy|https://stackoverflow.com/q/46845672/877069] (67K views) > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283 ] Nicholas Chammas edited comment on SPARK-26589 at 11/30/21, 6:17 PM: - I think there is a potential solution using the algorithm [described here by Michael Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers]. was (Author: nchammas): I'm going to try to implement this using the algorithm [described here by Michael Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers]. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe
[ https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283 ] Nicholas Chammas commented on SPARK-26589: -- I'm going to try to implement this using the algorithm [described here by Michael Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers]. > proper `median` method for spark dataframe > -- > > Key: SPARK-26589 > URL: https://issues.apache.org/jira/browse/SPARK-26589 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jan Gorecki >Priority: Minor > > I found multiple tickets asking for median function to be implemented in > Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as > duplicate of it. The thing is that approximate quantile is a workaround for > lack of median function. Thus I am filling this Feature Request for proper, > exact, not approximation of, median function. I am aware about difficulties > that are caused by distributed environment when trying to compute median, > nevertheless I don't think those difficulties is reason good enough to drop > out `median` function from scope of Spark. I am not asking about efficient > median but exact median. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames
[ https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-12185: - Labels: (was: bulk-closed) > Add Histogram support to Spark SQL/DataFrames > - > > Key: SPARK-12185 > URL: https://issues.apache.org/jira/browse/SPARK-12185 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Holden Karau >Priority: Minor > > While we have the ability to compute histograms on RDDs of Doubles it would > be good to also directly support histograms in Spark SQL (see > https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions > ). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames
[ https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-12185: -- Reopening this because I think it's a valid improvement that mirrors the existing {{RDD.histogram}} method. > Add Histogram support to Spark SQL/DataFrames > - > > Key: SPARK-12185 > URL: https://issues.apache.org/jira/browse/SPARK-12185 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Holden Karau >Priority: Minor > Labels: bulk-closed > > While we have the ability to compute histograms on RDDs of Doubles it would > be good to also directly support histograms in Spark SQL (see > https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions > ). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37393) Inline annotations for {ml, mllib}/common.py
[ https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-37393. -- Resolution: Duplicate > Inline annotations for {ml, mllib}/common.py > > > Key: SPARK-37393 > URL: https://issues.apache.org/jira/browse/SPARK-37393 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > This will allow us to run type checks against those files themselves. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37393) Inline annotations for {ml, mllib}/common.py
[ https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37393: - Description: This will allow us to run type checks against those files themselves. > Inline annotations for {ml, mllib}/common.py > > > Key: SPARK-37393 > URL: https://issues.apache.org/jira/browse/SPARK-37393 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > This will allow us to run type checks against those files themselves. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37393) Inline annotations for {ml, mllib}/common.py
Nicholas Chammas created SPARK-37393: Summary: Inline annotations for {ml, mllib}/common.py Key: SPARK-37393 URL: https://issues.apache.org/jira/browse/SPARK-37393 Project: Spark Issue Type: Improvement Components: ML, MLlib, PySpark Affects Versions: 3.2.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37380) Miscellaneous Python lint infra cleanup
Nicholas Chammas created SPARK-37380: Summary: Miscellaneous Python lint infra cleanup Key: SPARK-37380 URL: https://issues.apache.org/jira/browse/SPARK-37380 Project: Spark Issue Type: Improvement Components: Project Infra, PySpark Affects Versions: 3.2.0 Reporter: Nicholas Chammas * pyintrc is obsolete * tox.ini should be reformatted for easy reading/updating * requirements used on CI should be in requirements.txt -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37336) Migrate _java2py to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37336: - Summary: Migrate _java2py to SparkSession (was: Migrate common ML utils to SparkSession) > Migrate _java2py to SparkSession > > > Key: SPARK-37336 > URL: https://issues.apache.org/jira/browse/SPARK-37336 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > {{_java2py()}} uses a deprecated method to create a SparkSession. > > https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37336) Migrate common ML utils to SparkSession
Nicholas Chammas created SPARK-37336: Summary: Migrate common ML utils to SparkSession Key: SPARK-37336 URL: https://issues.apache.org/jira/browse/SPARK-37336 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.2.0 Reporter: Nicholas Chammas {{_java2py()}} uses a deprecated method to create a SparkSession. https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37335: - Description: The association rules returned by FPGrow include more columns than are documented, like {{{}lift{}}}: [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] We should offer a basic description of these columns. An _itemset_ should also be briefly defined. was: The association rules returned by FPGrow include more columns than are documented: [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] We should offer a basic description of these columns. > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37335) Clarify output of FPGrowth
Nicholas Chammas created SPARK-37335: Summary: Clarify output of FPGrowth Key: SPARK-37335 URL: https://issues.apache.org/jira/browse/SPARK-37335 Project: Spark Issue Type: Improvement Components: Documentation, ML Affects Versions: 3.2.0 Reporter: Nicholas Chammas The association rules returned by FPGrow include more columns than are documented: [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] We should offer a basic description of these columns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394 ] Nicholas Chammas edited comment on SPARK-24853 at 11/2/21, 2:41 PM: [~hyukjin.kwon] - It's not just for consistency. As noted in the description, this is useful when you are trying to rename a column with an ambiguous name. For example, imagine two tables {{left}} and {{right}}, each with a column called {{count}}: {code:python} ( left_counts.alias('left') .join(right_counts.alias('right'), on='join_key') .withColumn( 'total_count', left_counts['count'] + right_counts['count'] ) .withColumnRenamed('left.count', 'left_count') # no-op; alias doesn't work .withColumnRenamed('count', 'left_count') # incorrect; it renames both count columns .withColumnRenamed(left_counts['count'], 'left_count') # what, ideally, users want to do here .show() ){code} If you don't mind, I'm going to reopen this issue. was (Author: nchammas): [~hyukjin.kwon] - It's not just for consistency. As noted in the description, this is useful when you are trying to rename a column with an ambiguous name. For example, imagine two tables {{left}} and {{right}}, each with a column called {{count}}: {code:java} ( left_counts.alias('left') .join(right_counts.alias('right'), on='join_key') .withColumn( 'total_count', left_counts['count'] + right_counts['count'] ) .withColumnRenamed('left.count', 'left_count') # no-op; alias doesn't work .withColumnRenamed('count', 'left_count') # incorrect; it renames both count columns .withColumnRenamed(left_counts['count'], 'left_count') # what, ideally, users want to do here .show() ){code} If you don't mind, I'm going to reopen this issue. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437396#comment-17437396 ] Nicholas Chammas commented on SPARK-24853: -- The [contributing guide|https://spark.apache.org/contributing.html] isn't clear on how to populate "Affects Version" for improvements, so I've just tagged the latest release. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-24853: - Priority: Minor (was: Major) > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-24853: -- > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394 ] Nicholas Chammas commented on SPARK-24853: -- [~hyukjin.kwon] - It's not just for consistency. As noted in the description, this is useful when you are trying to rename a column with an ambiguous name. For example, imagine two tables {{left}} and {{right}}, each with a column called {{count}}: {code:java} ( left_counts.alias('left') .join(right_counts.alias('right'), on='join_key') .withColumn( 'total_count', left_counts['count'] + right_counts['count'] ) .withColumnRenamed('left.count', 'left_count') # no-op; alias doesn't work .withColumnRenamed('count', 'left_count') # incorrect; it renames both count columns .withColumnRenamed(left_counts['count'], 'left_count') # what, ideally, users want to do here .show() ){code} If you don't mind, I'm going to reopen this issue. > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2 >Reporter: nirav patel >Priority: Major > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-24853: - Affects Version/s: 3.2.0 > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2, 3.2.0 >Reporter: nirav patel >Priority: Minor > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322333#comment-17322333 ] Nicholas Chammas commented on SPARK-33000: -- Per the discussion [on the dev list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html] and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to update the documentation to clarify that {{cleanCheckpoints}} does not impact shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown (whether planned or unplanned), and the config is currently working as intended. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. > Evidence the current behavior is confusing: > * [https://stackoverflow.com/q/52630858/877069] > * [https://stackoverflow.com/q/60009856/877069] > * [https://stackoverflow.com/q/61454740/877069] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299283#comment-17299283 ] Nicholas Chammas commented on SPARK-33436: -- [~hyukjin.kwon] - Can you clarify please why this ticket is "Won't Fix"? Just so it's clear for others who come across this ticket. Is `._jsc` the intended way for PySpark users to set S3A configs? > PySpark equivalent of SparkContext.hadoopConfiguration > -- > > Key: SPARK-33436 > URL: https://issues.apache.org/jira/browse/SPARK-33436 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark should offer an API to {{hadoopConfiguration}} to [match > Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. > Setting Hadoop configs within a job is handy for any configurations that are > not appropriate as cluster defaults, or that will not be known until run > time. The various {{fs.s3a.*}} configs are a good example of this. > Currently, what people are doing is setting things like this [via > SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33000: - Description: Maybe it's just that the documentation needs to be updated, but I found this surprising: {code:python} $ pyspark ... >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> exit(){code} The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown. The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says: {quote}Controls whether to clean checkpoint files if the reference is out of scope. {quote} When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up. For the record, I see the same behavior in both the Scala and Python REPLs. Evidence the current behavior is confusing: * [https://stackoverflow.com/q/52630858/877069] * [https://stackoverflow.com/q/60009856/877069] * [https://stackoverflow.com/q/61454740/877069] was: Maybe it's just that the documentation needs to be updated, but I found this surprising: {code:python} $ pyspark ... >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> exit(){code} The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown. The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says: {quote}Controls whether to clean checkpoint files if the reference is out of scope. {quote} When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up. For the record, I see the same behavior in both the Scala and Python REPLs. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. > Evidence the current behavior is confusing: > * [https://stackoverflow.com/q/52630858/877069] > * [https://stackoverflow.com/q/60009856/877069] > * [https://stackoverflow.com/q/61454740/877069] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295531#comment-17295531 ] Nicholas Chammas commented on SPARK-33000: -- [~caowang888] - If you're still interested in this issue, take a look at my PR and let me know what you think. Hopefully, I've understood the issue correctly and proposed an appropriate fix. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-34194. -- Resolution: Won't Fix > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281269#comment-17281269 ] Nicholas Chammas commented on SPARK-34194: -- It's not clear to me whether SPARK-26709 is describing an inherent design issue that has no fix, or whether SPARK-26709 simply captures a bug in the past implementation of {{OptimizeMetadataOnlyQuery}} which could conceivably be fixed in the future. If it's something that could be fixed and reintroduced, this issue should stay open. If we know for design reasons that metadata-only queries cannot be made reliably correct, then this issue should be closed with a clear explanation to that effect. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869 ] Nicholas Chammas edited comment on SPARK-34194 at 2/2/21, 5:56 AM: --- Interesting reference, [~attilapiros]. It looks like that config is internal to Spark and was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. was (Author: nchammas): Interesting reference, [~attilapiros]. It looks like that config was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
[ https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869 ] Nicholas Chammas commented on SPARK-34194: -- Interesting reference, [~attilapiros]. It looks like that config was [deprecated in Spark 3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929] due to the correctness issue mentioned in that warning and documented in SPARK-26709. > Queries that only touch partition columns shouldn't scan through all files > -- > > Key: SPARK-34194 > URL: https://issues.apache.org/jira/browse/SPARK-34194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > When querying only the partition columns of a partitioned table, it seems > that Spark nonetheless scans through all files in the table, even though it > doesn't need to. > Here's an example: > {code:python} > >>> data = spark.read.option('mergeSchema', > >>> 'false').parquet('s3a://some/dataset') > [Stage 0:==> (407 + 12) / > 1158] > {code} > Note the 1158 tasks. This matches the number of partitions in the table, > which is partitioned on a single field named {{file_date}}: > {code:sh} > $ aws s3 ls s3://some/dataset | head -n 3 >PRE file_date=2017-05-01/ >PRE file_date=2017-05-02/ >PRE file_date=2017-05-03/ > $ aws s3 ls s3://some/dataset | wc -l > 1158 > {code} > The table itself has over 138K files, though: > {code:sh} > $ aws s3 ls --recursive --human --summarize s3://some/dataset > ... > Total Objects: 138708 >Total Size: 3.7 TiB > {code} > Now let's try to query just the {{file_date}} field and see what Spark does. > {code:python} > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).explain() > == Physical Plan == > TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], > output=[file_date#11]) > +- *(1) ColumnarToRow >+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: > Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: > [], PushedFilters: [], ReadSchema: struct<> > >>> data.select('file_date').orderBy('file_date', > >>> ascending=False).limit(1).show() > [Stage 2:> (179 + 12) / > 41011] > {code} > Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the > job progresses? I'm not sure. > What I do know is that this operation takes a long time (~20 min) running > from my laptop, whereas to list the top-level {{file_date}} partitions via > the AWS CLI take a second or two. > Spark appears to be going through all the files in the table, when it just > needs to list the partitions captured in the S3 "directory" structure. The > query is only touching {{file_date}}, after all. > The current workaround for this performance problem / optimizer wastefulness, > is to [query the catalog > directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot > of extra work compared to the elegant query against {{file_date}} that users > actually intend. > Spark should somehow know when it is only querying partition fields and skip > iterating through all the individual files in a table. > Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269571#comment-17269571 ] Nicholas Chammas commented on SPARK-12890: -- I've created SPARK-34194 and fleshed out the description of the problem a bit. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files
Nicholas Chammas created SPARK-34194: Summary: Queries that only touch partition columns shouldn't scan through all files Key: SPARK-34194 URL: https://issues.apache.org/jira/browse/SPARK-34194 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Nicholas Chammas When querying only the partition columns of a partitioned table, it seems that Spark nonetheless scans through all files in the table, even though it doesn't need to. Here's an example: {code:python} >>> data = spark.read.option('mergeSchema', >>> 'false').parquet('s3a://some/dataset') [Stage 0:==> (407 + 12) / 1158] {code} Note the 1158 tasks. This matches the number of partitions in the table, which is partitioned on a single field named {{file_date}}: {code:sh} $ aws s3 ls s3://some/dataset | head -n 3 PRE file_date=2017-05-01/ PRE file_date=2017-05-02/ PRE file_date=2017-05-03/ $ aws s3 ls s3://some/dataset | wc -l 1158 {code} The table itself has over 138K files, though: {code:sh} $ aws s3 ls --recursive --human --summarize s3://some/dataset ... Total Objects: 138708 Total Size: 3.7 TiB {code} Now let's try to query just the {{file_date}} field and see what Spark does. {code:python} >>> data.select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], output=[file_date#11]) +- *(1) ColumnarToRow +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> data.select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 2:> (179 + 12) / 41011] {code} Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the job progresses? I'm not sure. What I do know is that this operation takes a long time (~20 min) running from my laptop, whereas to list the top-level {{file_date}} partitions via the AWS CLI take a second or two. Spark appears to be going through all the files in the table, when it just needs to list the partitions captured in the S3 "directory" structure. The query is only touching {{file_date}}, after all. The current workaround for this performance problem / optimizer wastefulness, is to [query the catalog directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot of extra work compared to the elegant query against {{file_date}} that users actually intend. Spark should somehow know when it is only querying partition fields and skip iterating through all the individual files in a table. Tested on Spark 3.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267657#comment-17267657 ] Nicholas Chammas commented on SPARK-12890: -- Sure, will do. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253044#comment-17253044 ] Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM: I think this is still an open issue. On Spark 2.4.6: {code:java} >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], output=[file_date#144]) +- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 23:=>(2110 + 12) / 21049] {code} {{file_date}} is a partitioning column: {code:java} $ aws s3 ls s3://some/dataset/ PRE file_date=2018-10-02/ PRE file_date=2018-10-08/ PRE file_date=2018-10-15/ ...{code} Schema merging is not enabled, as far as I can tell. Shouldn't Spark be able to answer this query without going through ~20K files? Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when it's only projecting partitioning columns? For the record, the best current workaround appears to be to [use the catalog to list partitions and extract what's needed that way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark should handle this situation better. was (Author: nchammas): I think this is still an open issue. On Spark 2.4.6: {code:java} >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], output=[file_date#144]) +- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 23:=>(2110 + 12) / 21049] {code} {{file_date}} is a partitioning column: {code:java} $ aws s3 ls s3://some/dataset/ PRE file_date=2018-10-02/ PRE file_date=2018-10-08/ PRE file_date=2018-10-15/ ...{code} Schema merging is not enabled, as far as I can tell. Shouldn't Spark be able to answer this query without going through ~20K files? Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when it's only projecting partitioning columns? For the record, the best current workaround appears to be to [use the catalog to list partitions and extract what's needed that way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark should handle this situation better. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas reopened SPARK-12890: -- Reopening because I think there is a valid potential improvement to be made here. If I've misunderstood, feel free to close again. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-12890: - Labels: (was: bulk-closed) > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Major > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-12890: - Priority: Minor (was: Major) > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Minor > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253044#comment-17253044 ] Nicholas Chammas commented on SPARK-12890: -- I think this is still an open issue. On Spark 2.4.6: {code:java} >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).explain() == Physical Plan == TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], output=[file_date#144]) +- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> >>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date', >>> ascending=False).limit(1).show() [Stage 23:=>(2110 + 12) / 21049] {code} {{file_date}} is a partitioning column: {code:java} $ aws s3 ls s3://some/dataset/ PRE file_date=2018-10-02/ PRE file_date=2018-10-08/ PRE file_date=2018-10-15/ ...{code} Schema merging is not enabled, as far as I can tell. Shouldn't Spark be able to answer this query without going through ~20K files? Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when it's only projecting partitioning columns? For the record, the best current workaround appears to be to [use the catalog to list partitions and extract what's needed that way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark should handle this situation better. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam >Priority: Major > Labels: bulk-closed > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33436: - Description: PySpark should offer an API to {{hadoopConfiguration}} to [match Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. Setting Hadoop configs within a job is handy for any configurations that are not appropriate as cluster defaults, or that will not be known until run time. The various {{fs.s3a.*}} configs are a good example of this. Currently, what people are doing is setting things like this [via SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069]. was: PySpark should offer an API to {{hadoopConfiguration}} to [match Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. Setting Hadoop configs within a job is handy for any configurations that are not appropriate as global defaults, or that will not be known until run time. The various {{fs.s3a.*}} configs are a good example of this. Currently, what people are doing is setting things like this [via SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069]. > PySpark equivalent of SparkContext.hadoopConfiguration > -- > > Key: SPARK-33436 > URL: https://issues.apache.org/jira/browse/SPARK-33436 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark should offer an API to {{hadoopConfiguration}} to [match > Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. > Setting Hadoop configs within a job is handy for any configurations that are > not appropriate as cluster defaults, or that will not be known until run > time. The various {{fs.s3a.*}} configs are a good example of this. > Currently, what people are doing is setting things like this [via > SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33436: - Description: PySpark should offer an API to {{hadoopConfiguration}} to [match Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. Setting Hadoop configs within a job is handy for any configurations that are not appropriate as global defaults, or that will not be known until run time. The various {{fs.s3a.*}} configs are a good example of this. Currently, what people are doing is setting things like this [via SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069]. was: PySpark should offer an API to {{hadoopConfiguration}} to [match Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. Setting Hadoop configs within a job is handy for any configurations that are not appropriate as global defaults, or that will not be known until run time. The various {{fs.s3a.*}} configs are a good example of this. > PySpark equivalent of SparkContext.hadoopConfiguration > -- > > Key: SPARK-33436 > URL: https://issues.apache.org/jira/browse/SPARK-33436 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark should offer an API to {{hadoopConfiguration}} to [match > Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. > Setting Hadoop configs within a job is handy for any configurations that are > not appropriate as global defaults, or that will not be known until run time. > The various {{fs.s3a.*}} configs are a good example of this. > Currently, what people are doing is setting things like this [via > SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration
Nicholas Chammas created SPARK-33436: Summary: PySpark equivalent of SparkContext.hadoopConfiguration Key: SPARK-33436 URL: https://issues.apache.org/jira/browse/SPARK-33436 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0 Reporter: Nicholas Chammas PySpark should offer an API to {{hadoopConfiguration}} to [match Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration]. Setting Hadoop configs within a job is handy for any configurations that are not appropriate as global defaults, or that will not be known until run time. The various {{fs.s3a.*}} configs are a good example of this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33434: - Affects Version/s: (was: 2.4.7) (was: 3.0.1) 3.1.0 > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33434: - Affects Version/s: 3.0.1 > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.4.7, 3.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33434: - Description: PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears to be a public method introduced in SPARK-24761. http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf was:PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears to be a public method introduced in SPARK-24761. > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.4.7 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33434) Document spark.conf.isModifiable()
Nicholas Chammas created SPARK-33434: Summary: Document spark.conf.isModifiable() Key: SPARK-33434 URL: https://issues.apache.org/jira/browse/SPARK-33434 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 2.4.7 Reporter: Nicholas Chammas PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears to be a public method introduced in SPARK-24761. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache
[ https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218374#comment-17218374 ] Nicholas Chammas commented on SPARK-26764: -- The SPIP PDF references a design doc, but I'm not clear on where the design doc actually is. Is this issue supposed to be linked to some other ones? Also, appendix B suggests to me that this idea would mesh well with the existing proposals to support materialized views. I could actually see this as an enhancement to those proposals, like SPARK-29038. In fact, when I look at the [design doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit#] for SPARK-29038, I see that goal 3 covers automatic query rewrites, which I think subsumes the main benefit of this proposal as compared to "traditional" materialized views. {quote}> 3. A query _rewrite_ capability to transparently rewrite a query to use a materialized view[1][2]. > a. Query rewrite capability is transparent to SQL applications. > b. Query rewrite can be disabled at the system level or on individual > materialized view. Also it can be disabled for a specified query via hint. > c. Query rewrite as a rule in optimizer should be made sure that it won’t > cause performance regression if it can use other index or cache. {quote} > [SPIP] Spark Relational Cache > - > > Key: SPARK-26764 > URL: https://issues.apache.org/jira/browse/SPARK-26764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Adrian Wang >Priority: Major > Attachments: Relational+Cache+SPIP.pdf > > > In modern database systems, relational cache is a common technology to boost > ad-hoc queries. While Spark provides cache natively, Spark SQL should be able > to utilize the relationship between relations to boost all possible queries. > In this SPIP, we will make Spark be able to utilize all defined cached > relations if possible, without explicit substitution in user query, as well > as keep some user defined cache available in different sessions. Materialized > views in many database systems provide similar function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215577#comment-17215577 ] Nicholas Chammas commented on SPARK-33000: -- Ctrl-D gracefully shuts down the Python REPL, so that should trigger the appropriate cleanup. I repeated my test and did {{spark.stop()}} instead of Ctrl-D and waited 2 minutes. Same result. No cleanup. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215469#comment-17215469 ] Nicholas Chammas commented on SPARK-33000: -- I've tested this out a bit more, and I think the original issue I reported is valid. Either that, or I'm still missing something. I built Spark at the latest commit from {{master}}: {code:java} commit 3ae1520185e2d96d1bdbd08c989f0d48ad3ba578 (HEAD -> master, origin/master, origin/HEAD, apache/master) Author: ulysses Date: Fri Oct 16 11:26:27 2020 + {code} One thing that has changed is that Spark now prevents you from setting {{cleanCheckpoints}} after startup: {code:java} >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/conf.py", line 36, in set self._jconf.set(key, value) File ".../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: spark.cleaner.referenceTracking.cleanCheckpoints; {code} So that's good! This makes it clear to the user that this setting cannot be set at this time (though it could be made more helpful it explained why). However, if I try to set the config as part of invoking PySpark, I still don't see any checkpointed data get cleaned up on shutdown: {code:java} $ rm -rf /tmp/spark/checkpoint/ $ ./bin/pyspark --conf spark.cleaner.referenceTracking.cleanCheckpoints=true >>> spark.conf.get('spark.cleaner.referenceTracking.cleanCheckpoints') 'true' >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> $ du -sh /tmp/spark/checkpoint/* 32K /tmp/spark/checkpoint/57b0a413-9d47-4bcd-99ef-265e9f5c0f3b{code} > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215085#comment-17215085 ] Nicholas Chammas commented on SPARK-33000: -- Thanks for the explanation! I'm happy to leave this to you if you'd like to get back into open source work. If you're not sure you'll get to it in the next couple of weeks, let me know and I can take care of this since it's just a documentation update. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214904#comment-17214904 ] Nicholas Chammas commented on SPARK-33000: -- Thanks for the pointer! No need for a new ticket. I can adjust the title of this ticket and open a PR for the doc update. Separate question: Do you know why there is such a restriction on when this configuration needs to be set? It's surprising since you can set the checkpoint directory after job submission. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33017) PySpark Context should have getCheckpointDir() method
Nicholas Chammas created SPARK-33017: Summary: PySpark Context should have getCheckpointDir() method Key: SPARK-33017 URL: https://issues.apache.org/jira/browse/SPARK-33017 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0 Reporter: Nicholas Chammas To match the Scala API, PySpark should offer a direct way to get the checkpoint dir. {code:scala} scala> spark.sparkContext.setCheckpointDir("/tmp/spark/checkpoint") scala> spark.sparkContext.getCheckpointDir res3: Option[String] = Some(file:/tmp/spark/checkpoint/34ebe699-bc83-4c5d-bfa2-50451296cf87) {code} Currently, the only was to do that from PySpark is via the underlying Java context: {code:python} >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> sc._jsc.sc().getCheckpointDir().get() 'file:/tmp/spark/checkpoint/ebf0fab5-edbc-42c2-938f-65d5e599cf54' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-33000: - Description: Maybe it's just that the documentation needs to be updated, but I found this surprising: {code:python} $ pyspark ... >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> exit(){code} The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown. The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says: {quote}Controls whether to clean checkpoint files if the reference is out of scope. {quote} When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up. For the record, I see the same behavior in both the Scala and Python REPLs. was: Maybe it's just that the documentation needs to be updated, but I found this surprising: {code:java} $ pyspark ... >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> exit(){code} The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown. The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says: > Controls whether to clean checkpoint files if the reference is out of scope. When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
Nicholas Chammas created SPARK-33000: Summary: cleanCheckpoints config does not clean all checkpointed RDDs on shutdown Key: SPARK-33000 URL: https://issues.apache.org/jira/browse/SPARK-33000 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.6 Reporter: Nicholas Chammas Maybe it's just that the documentation needs to be updated, but I found this surprising: {code:java} $ pyspark ... >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') >>> a = spark.range(10) >>> a.checkpoint() DataFrame[id: bigint] >>> exit(){code} The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected Spark to clean it up on shutdown. The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says: > Controls whether to clean checkpoint files if the reference is out of scope. When Spark shuts down, everything goes out of scope, so I'd expect all checkpointed RDDs to be cleaned up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py
[ https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187364#comment-17187364 ] Nicholas Chammas commented on SPARK-32084: -- Can you share a couple of examples of what you are referring to in this ticket? > Replace dictionary-based function definitions to proper functions in > functions.py > - > > Key: SPARK-32084 > URL: https://issues.apache.org/jira/browse/SPARK-32084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently some functions in {{functions.py}} are defined by a dictionary. It > programmatically defines the functions to the module; however, it makes some > IDEs such as PyCharm don't detect. > Also, it makes hard to add proper examples into the docstrings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies
[ https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31167: - Description: Ideally, we should have a single place to track Python development dependencies and reuse it in all the relevant places: developer docs, Dockerfile, and GitHub CI. Where appropriate, we should pin dependencies to ensure a reproducible Python environment. (was: Ideally, we should have a single place to track Python development dependencies and reuse it in all the relevant places: developer docs, Dockerfile, ) > Refactor how we track Python test/build dependencies > > > Key: SPARK-31167 > URL: https://issues.apache.org/jira/browse/SPARK-31167 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > Ideally, we should have a single place to track Python development > dependencies and reuse it in all the relevant places: developer docs, > Dockerfile, and GitHub CI. Where appropriate, we should pin dependencies to > ensure a reproducible Python environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies
[ https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31167: - Description: Ideally, we should have a single place to track Python development dependencies and reuse it in all the relevant places: developer docs, Dockerfile, > Refactor how we track Python test/build dependencies > > > Key: SPARK-31167 > URL: https://issues.apache.org/jira/browse/SPARK-31167 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > Ideally, we should have a single place to track Python development > dependencies and reuse it in all the relevant places: developer docs, > Dockerfile, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries
Nicholas Chammas created SPARK-32686: Summary: Un-deprecate inferring DataFrame schema from list of dictionaries Key: SPARK-32686 URL: https://issues.apache.org/jira/browse/SPARK-32686 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.1.0 Reporter: Nicholas Chammas Inferring the schema of a DataFrame from a list of dictionaries feels natural for PySpark users, and also mirrors [basic functionality in Pandas|https://stackoverflow.com/a/20638258/877069]. This is currently possible in PySpark but comes with a deprecation warning. We should un-deprecate this behavior if there are no deeper reasons to discourage users from this API, beyond wanting to push them to use {{Row}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries
[ https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182040#comment-17182040 ] Nicholas Chammas commented on SPARK-32686: -- Not sure if I have the "Affects Version" field set appropriately. > Un-deprecate inferring DataFrame schema from list of dictionaries > - > > Key: SPARK-32686 > URL: https://issues.apache.org/jira/browse/SPARK-32686 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > Inferring the schema of a DataFrame from a list of dictionaries feels natural > for PySpark users, and also mirrors [basic functionality in > Pandas|https://stackoverflow.com/a/20638258/877069]. > This is currently possible in PySpark but comes with a deprecation warning. > We should un-deprecate this behavior if there are no deeper reasons to > discourage users from this API, beyond wanting to push them to use {{Row}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
[ https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089917#comment-17089917 ] Nicholas Chammas commented on SPARK-27623: -- Is this perhaps just a documentation issue? i.e. The documentation [here|http://spark.apache.org/docs/2.4.5/sql-data-sources-avro.html#deploying] should use 2.11 instead of 2.12, or at least clarify how to identify what to use. The [downloads page|http://spark.apache.org/downloads.html] does say: {quote}Note that, Spark is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. {quote} Which I think explains the behavior people are reporting here. > Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > --- > > Key: SPARK-27623 > URL: https://issues.apache.org/jira/browse/SPARK-27623 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.2 >Reporter: Alexandru Barbulescu >Priority: Major > > After updating to spark 2.4.2 when using the > {code:java} > spark.read.format().options().load() > {code} > > chain of methods, regardless of what parameter is passed to "format" we get > the following error related to avro: > > {code:java} > - .options(**load_options) > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line > 172, in load > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in > deco > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load. > - : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > - at java.util.ServiceLoader.fail(ServiceLoader.java:232) > - at java.util.ServiceLoader.access$100(ServiceLoader.java:185) > - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) > - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > - at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > - at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > - at scala.collection.Iterator.foreach(Iterator.scala:941) > - at scala.collection.Iterator.foreach$(Iterator.scala:941) > - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > - at scala.collection.IterableLike.foreach(IterableLike.scala:74) > - at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > - at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) > - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) > - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > - at scala.collection.TraversableLike.filter(TraversableLike.scala:262) > - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) > - at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > - at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > - at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > - at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > - at java.lang.reflect.Method.invoke(Method.java:498) > - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > - at py4j.Gateway.invoke(Gateway.java:282) > - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > - at py4j.commands.CallCommand.execute(CallCommand.java:79) > - at py4j.GatewayConnection.run(GatewayConnection.java:238) > - at java.lang.Thread.run(Thread.java:748) > - Caused by: java.lang.NoClassDefFoundError: > org/apache/spark/sql/execution/datasources/FileFormat$class > - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44) > - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > - at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > - at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > - at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > - at java.lang.Class.newInstance(Class.java:442) > - at
[jira] [Commented] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088997#comment-17088997 ] Nicholas Chammas commented on SPARK-31170: -- Isn't this also an issue in Spark 2.4.5? {code:java} $ spark-sql ... 20/04/21 15:12:26 INFO metastore: Mestastore configuration hive.metastore.warehouse.dir changed from /user/hive/warehouse to file:/Users/myusername/spark-warehouse ... spark-sql> create table test(a int); ... 20/04/21 15:12:48 WARN HiveMetaStore: Location: file:/user/hive/warehouse/test specified for non-external table:test 20/04/21 15:12:48 INFO FileUtils: Creating directory if it doesn't exist: file:/user/hive/warehouse/test Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/test is not a directory or unable to create one);{code} If I understood the problem correctly, then the fix should perhaps also be backported to 2.4.x. > Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir > > > Key: SPARK-31170 > URL: https://issues.apache.org/jira/browse/SPARK-31170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > In Spark CLI, we create a hive CliSessionState and it does not load the > hive-site.xml. So the configurations in hive-site.xml will not take effects > like other spark-hive integration apps. > Also, the warehouse directory is not correctly picked. If the `default` > database does not exist, the CliSessionState will create one during the first > time it talks to the metastore. The `Location` of the default DB will be > neither the value of spark.sql.warehousr.dir nor the user-specified value of > hive.metastore.warehourse.dir, but the default value of > hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch
[ https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074217#comment-17074217 ] Nicholas Chammas commented on SPARK-31330: -- Hmm, I didn't see anything from you on the mailing list. But thanks for these references! This is very helpful. Looks like you had Infra enable autolabeler for the Avro project over in INFRA-17367. I will ask Infra to do the same for Spark and cc [~hyukjin.kwon] for committer approval (which I guess Infra may ask for). > Automatically label PRs based on the paths they touch > - > > Key: SPARK-31330 > URL: https://issues.apache.org/jira/browse/SPARK-31330 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We can potentially leverage the added labels to drive testing, review, or > other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch
[ https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074124#comment-17074124 ] Nicholas Chammas commented on SPARK-31330: -- Unfortunately, it seems I jumped the gun on sending that dev email about the GitHub PR labeler action. It has a fundamental limitation that currently makes it [useless for us|https://github.com/actions/labeler/tree/d2c408e7ed8498dfdf675c5f8d133ab37b6f8520#pull-request-labeler]: {quote}Note that only pull requests being opened from the same repository can be labeled. This action will not currently work for pull requests from forks – like is common in open source projects – because the token for forked pull request workflows does not have write permissions. {quote} Additional detail: [https://github.com/actions/labeler/issues/12#issuecomment-525762657] I'll keep my eye on that Action in case they somehow lift or work around the limitation on forked repositories. Of course, we can always implement this functionality ourselves, but the attraction of the GitHub Action was that we could reuse an existing, tested, and widely adopted implementation. > Automatically label PRs based on the paths they touch > - > > Key: SPARK-31330 > URL: https://issues.apache.org/jira/browse/SPARK-31330 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We can potentially leverage the added labels to drive testing, review, or > other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31330) Automatically label PRs based on the paths they touch
Nicholas Chammas created SPARK-31330: Summary: Automatically label PRs based on the paths they touch Key: SPARK-31330 URL: https://issues.apache.org/jira/browse/SPARK-31330 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.1.0 Reporter: Nicholas Chammas We can potentially leverage the added labels to drive testing, review, or other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies
[ https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31167: - Summary: Refactor how we track Python test/build dependencies (was: Refactor how we track Python test dependencies) > Refactor how we track Python test/build dependencies > > > Key: SPARK-31167 > URL: https://issues.apache.org/jira/browse/SPARK-31167 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31167) Refactor how we track Python test dependencies
[ https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31167: - Summary: Refactor how we track Python test dependencies (was: Specify missing test dependencies) > Refactor how we track Python test dependencies > -- > > Key: SPARK-31167 > URL: https://issues.apache.org/jira/browse/SPARK-31167 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31167) Specify missing test dependencies
Nicholas Chammas created SPARK-31167: Summary: Specify missing test dependencies Key: SPARK-31167 URL: https://issues.apache.org/jira/browse/SPARK-31167 Project: Spark Issue Type: Improvement Components: Build, Tests Affects Versions: 3.1.0 Reporter: Nicholas Chammas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31155) Remove pydocstyle tests
[ https://issues.apache.org/jira/browse/SPARK-31155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31155: - Summary: Remove pydocstyle tests (was: Enable pydocstyle tests) > Remove pydocstyle tests > --- > > Key: SPARK-31155 > URL: https://issues.apache.org/jira/browse/SPARK-31155 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > pydocstyle tests have been running neither on Jenkins nor on Github. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31155) Remove pydocstyle tests
[ https://issues.apache.org/jira/browse/SPARK-31155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31155: - Description: pydocstyle tests have been running neither on Jenkins nor on Github. We also seem to be in a [bad place|https://github.com/apache/spark/pull/27912#issuecomment-599167117] to re-enable them. was:pydocstyle tests have been running neither on Jenkins nor on Github. > Remove pydocstyle tests > --- > > Key: SPARK-31155 > URL: https://issues.apache.org/jira/browse/SPARK-31155 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > pydocstyle tests have been running neither on Jenkins nor on Github. > We also seem to be in a [bad > place|https://github.com/apache/spark/pull/27912#issuecomment-599167117] to > re-enable them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29280) DataFrameReader should support a compression option
[ https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-29280: - Affects Version/s: (was: 3.0.0) 3.1.0 > DataFrameReader should support a compression option > --- > > Key: SPARK-29280 > URL: https://issues.apache.org/jira/browse/SPARK-29280 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter] > supports a {{compression}} option, but > [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > doesn't. The lack of a {{compression}} option in the reader causes some > friction in the following cases: > # You want to read some data compressed with a codec that Spark does not > [load by > default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization]. > # You want to read some data with a codec that overrides one of the built-in > codecs that Spark supports. > # You want to explicitly instruct Spark on what codec to use on read when it > will not be able to correctly auto-detect it (e.g. because the file extension > is [missing,|https://stackoverflow.com/q/52011697/877069] > [non-standard|https://stackoverflow.com/q/44372995/877069], or > [incorrect|https://stackoverflow.com/q/49110384/877069]). > Case #2 came up in SPARK-29102. There is a very handy library called > [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you > load a single gzipped file using multiple concurrent tasks. (You can see the > details of how it works and why it's useful in the project README and in > SPARK-29102.) > To use this codec, I had to set {{io.compression.codecs}}. I guess this is a > Hadoop filesystem API setting, since it [doesn't appear to be documented by > Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, > there is also a setting called {{spark.io.compression.codec}}, which seems to > be for a different purpose. > It would be much clearer for the user and more consistent with the writer > interface if the reader let you directly specify the codec. > For example, I think all of the following should be possible: > {code:python} > spark.read.option('compression', 'lz4').csv(...) > spark.read.csv(..., > compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') > spark.read.json(..., compression='none') > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31155) Enable pydocstyle tests
Nicholas Chammas created SPARK-31155: Summary: Enable pydocstyle tests Key: SPARK-31155 URL: https://issues.apache.org/jira/browse/SPARK-31155 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 3.0.0 Reporter: Nicholas Chammas pydocstyle tests have been running neither on Jenkins nor on Github. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31153) Cleanup several failures in lint-python
Nicholas Chammas created SPARK-31153: Summary: Cleanup several failures in lint-python Key: SPARK-31153 URL: https://issues.apache.org/jira/browse/SPARK-31153 Project: Spark Issue Type: Bug Components: Build, PySpark Affects Versions: 3.0.0 Reporter: Nicholas Chammas Don't understand how this script runs fine on the build server. Perhaps we've just been getting lucky? Will detail the issues on the PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31075) Add documentation for ALTER TABLE ... ADD PARTITION
[ https://issues.apache.org/jira/browse/SPARK-31075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-31075. -- Resolution: Duplicate > Add documentation for ALTER TABLE ... ADD PARTITION > --- > > Key: SPARK-31075 > URL: https://issues.apache.org/jira/browse/SPARK-31075 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > Our docs for {{ALTER TABLE}} > [currently|https://github.com/apache/spark/blob/cba17e07e9f15673f274de1728f6137d600026e1/docs/sql-ref-syntax-ddl-alter-table.md] > make no mention of {{ADD PARTITION}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054676#comment-17054676 ] Nicholas Chammas commented on SPARK-31043: -- It's working for me now (per my comment), but when I was seeing the issue simply starting a PySpark shell was enough to throw that error. Do you still need a reproduction? > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 3.0.0 > > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054674#comment-17054674 ] Nicholas Chammas commented on SPARK-31065: -- Thanks for looking into it. I have a silly question: Why isn't {{schema_of_json()}} simply syntactic sugar for {{spark.read.json()}}? For example: {code:python} def _schema_of_json(string): df = spark.read.json(spark.sparkContext.parallelize([string])) return df.schema{code} Perhaps there are practical reasons not to do this, but conceptually speaking this kind of equivalence should hold. Yet this bug report demonstrates that they are not equivalent. {code:python} from pyspark.sql.functions import from_json, schema_of_json json = '{"a": ""}' df = spark.createDataFrame([(json,)], schema=['json']) df.show() # chokes with org.apache.spark.sql.catalyst.parser.ParseException json_schema = schema_of_json(json) df.select(from_json('json', json_schema)) # works fine json_schema = _schema_of_json(json) df.select(from_json('json', json_schema)) {code} > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR',
[jira] [Created] (SPARK-31075) Add documentation for ALTER TABLE ... ADD PARTITION
Nicholas Chammas created SPARK-31075: Summary: Add documentation for ALTER TABLE ... ADD PARTITION Key: SPARK-31075 URL: https://issues.apache.org/jira/browse/SPARK-31075 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Nicholas Chammas Our docs for {{ALTER TABLE}} [currently|https://github.com/apache/spark/blob/cba17e07e9f15673f274de1728f6137d600026e1/docs/sql-ref-syntax-ddl-alter-table.md] make no mention of {{ADD PARTITION}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Show Maven errors from within make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: {code:java} ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud {code} But this doesn't: {code:java} ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip{code} The latter invocation yields the following, confusing output: {code:java} + VERSION=' -X,--debug Produce execution debug output'{code} That's because Maven is accepting {{--pip}} as an option and failing, but the user doesn't get to see the error from Maven. was: This works: {code:java} ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud {code} But this doesn't: {code:java} ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip{code} The latter invocation yields the following, confusing output: {code:java} + VERSION=' -X,--debug Produce execution debug output'{code} > Show Maven errors from within make-distribution.sh > -- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > {code:java} > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud {code} > > But this doesn't: > {code:java} > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip{code} > > The latter invocation yields the following, confusing output: > {code:java} > + VERSION=' -X,--debug Produce execution debug output'{code} > That's because Maven is accepting {{--pip}} as an option and failing, but > the user doesn't get to see the error from Maven. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Show Maven errors from within make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Summary: Show Maven errors from within make-distribution.sh (was: Make arguments to make-distribution.sh position-independent) > Show Maven errors from within make-distribution.sh > -- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > {code:java} > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud {code} > > But this doesn't: > {code:java} > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip{code} > > The latter invocation yields the following, confusing output: > {code:java} > + VERSION=' -X,--debug Produce execution debug output'{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master
[ https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053080#comment-17053080 ] Nicholas Chammas commented on SPARK-31043: -- FWIW I was seeing the same {{java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal}} issue on {{branch-3.0}} and pulling the latest changes fixed it for me too. > Spark 3.0 built against hadoop2.7 can't start standalone master > --- > > Key: SPARK-31043 > URL: https://issues.apache.org/jira/browse/SPARK-31043 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 3.0.0 > > > trying to start a standalone master when building spark branch 3.0 with > hadoop2.7 fails with: > > Exception in thread "main" java.lang.NoClassDefFoundError: > org/w3c/dom/ElementTraversal > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at > [java.net|http://java.net/] > .URLClassLoader.defineClass(URLClassLoader.java:468) > at > [java.net|http://java.net/] > .URLClassLoader.access$100(URLClassLoader.java:74) > at > [java.net|http://java.net/] > .URLClassLoader$1.run(URLClassLoader.java:369) > ... > Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal > at > [java.net|http://java.net/] > .URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > ... 42 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053079#comment-17053079 ] Nicholas Chammas commented on SPARK-31065: -- Confirmed this issue is also present on {{branch-3.0}} as of commit {{9b48f3358d3efb523715a5f258e5ed83e28692f6}}. > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
[jira] [Updated] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31065: - Affects Version/s: 3.0.0 > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527) > at
[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
[ https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053052#comment-17053052 ] Nicholas Chammas commented on SPARK-31065: -- cc [~hyukjin.kwon] > Empty string values cause schema_of_json() to return a schema not usable by > from_json() > --- > > Key: SPARK-31065 > URL: https://issues.apache.org/jira/browse/SPARK-31065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a reproduction: > > {code:python} > from pyspark.sql.functions import from_json, schema_of_json > json = '{"a": ""}' > df = spark.createDataFrame([(json,)], schema=['json']) > df.show() > # chokes with org.apache.spark.sql.catalyst.parser.ParseException > json_schema = schema_of_json(json) > df.select(from_json('json', json_schema)) > # works fine > json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema > df.select(from_json('json', json_schema)) > {code} > The output: > {code:java} > >>> from pyspark.sql.functions import from_json, schema_of_json > >>> json = '{"a": ""}' > >>> > >>> df = spark.createDataFrame([(json,)], schema=['json']) > >>> df.show() > +-+ > | json| > +-+ > |{"a": ""}| > +-+ > >>> > >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException > >>> json_schema = schema_of_json(json) > >>> df.select(from_json('json', json_schema)) > Traceback (most recent call last): > File ".../site-packages/pyspark/sql/utils.py", line 63, in deco > return f(*a, **kw) > File > ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.functions.from_json. > : org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', > 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', > 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', > 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', > 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', > 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', > 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', > 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', > 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', > 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', > 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', > 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', > 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', > 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', > 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', > 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', > 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', > 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', > 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', > 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', > 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', > 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', > 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', > 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', > 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', > 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', > 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', > 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', > 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', > 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) > == SQL == > struct > --^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) > at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) > at > org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) > at > org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527) >
[jira] [Created] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()
Nicholas Chammas created SPARK-31065: Summary: Empty string values cause schema_of_json() to return a schema not usable by from_json() Key: SPARK-31065 URL: https://issues.apache.org/jira/browse/SPARK-31065 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Nicholas Chammas Here's a reproduction: {code:python} from pyspark.sql.functions import from_json, schema_of_json json = '{"a": ""}' df = spark.createDataFrame([(json,)], schema=['json']) df.show() # chokes with org.apache.spark.sql.catalyst.parser.ParseException json_schema = schema_of_json(json) df.select(from_json('json', json_schema)) # works fine json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema df.select(from_json('json', json_schema)) {code} The output: {code:java} >>> from pyspark.sql.functions import from_json, schema_of_json >>> json = '{"a": ""}' >>> >>> df = spark.createDataFrame([(json,)], schema=['json']) >>> df.show() +-+ | json| +-+ |{"a": ""}| +-+ >>> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException >>> json_schema = schema_of_json(json) >>> df.select(from_json('json', json_schema)) Traceback (most recent call last): File ".../site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.from_json. : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6) == SQL == struct --^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123) at org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527) at org.apache.spark.sql.functions$.from_json(functions.scala:3606) at org.apache.spark.sql.functions.from_json(functions.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: {code:java} ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud {code} But this doesn't: {code:java} ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip{code} The latter invocation yields the following, confusing output: {code:java} + VERSION=' -X,--debug Produce execution debug output'{code} was: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > {code:java} > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud {code} > > But this doesn't: > {code:java} > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip{code} > > The latter invocation yields the following, confusing output: > {code:java} > + VERSION=' -X,--debug Produce execution debug output'{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` was: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > ``` > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud > ``` > > But this doesn't: > ``` > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Description: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip ``` was: This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip``` > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > > ``` > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud > ``` > > But this doesn't: > > ``` > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-31041: - Summary: Make arguments to make-distribution.sh position-independent (was: Make argument to make-distribution position-independent) > Make arguments to make-distribution.sh position-independent > --- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Trivial > > This works: > > ``` > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud > ``` > > But this doesn't: > > ``` > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31041) Make argument to make-distribution position-independent
Nicholas Chammas created SPARK-31041: Summary: Make argument to make-distribution position-independent Key: SPARK-31041 URL: https://issues.apache.org/jira/browse/SPARK-31041 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.1.0 Reporter: Nicholas Chammas This works: ``` ./dev/make-distribution.sh \ --pip \ -Phadoop-2.7 -Phive -Phadoop-cloud ``` But this doesn't: ``` ./dev/make-distribution.sh \ -Phadoop-2.7 -Phive -Phadoop-cloud \ --pip``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()
Nicholas Chammas created SPARK-31001: Summary: Add ability to create a partitioned table via catalog.createTable() Key: SPARK-31001 URL: https://issues.apache.org/jira/browse/SPARK-31001 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Nicholas Chammas There doesn't appear to be a way to create a partitioned table using the Catalog interface. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31000) Add ability to set table description in the catalog
Nicholas Chammas created SPARK-31000: Summary: Add ability to set table description in the catalog Key: SPARK-31000 URL: https://issues.apache.org/jira/browse/SPARK-31000 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Nicholas Chammas It seems that the catalog supports a {{description}} attribute on tables. https://github.com/apache/spark/blob/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala#L68 However, the {{createTable()}} interface doesn't provide any way to set that attribute. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30838) Add missing pages to documentation index
[ https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas resolved SPARK-30838. -- Resolution: Won't Fix > Add missing pages to documentation index > > > Key: SPARK-30838 > URL: https://issues.apache.org/jira/browse/SPARK-30838 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few pages tracked in `docs/` that are not linked to from the top > navigation or from the index page. Seems unintentional. > To make these pages easier to discover, we should at least list them in the > index and link to them from other pages as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30838) Add missing pages to documentation index
[ https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039351#comment-17039351 ] Nicholas Chammas commented on SPARK-30838: -- Actually, it looks like the pages I wanted to add (hadoop-provided, cloud infra, web ui) already have references from one place or other. They're not as discoverable as I'd like, but I don't have a good idea as to how to improve things right now so I'm going to close this. > Add missing pages to documentation index > > > Key: SPARK-30838 > URL: https://issues.apache.org/jira/browse/SPARK-30838 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few pages tracked in `docs/` that are not linked to from the top > navigation or from the index page. Seems unintentional. > To make these pages easier to discover, we should at least list them in the > index and link to them from other pages as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30838) Add missing pages to documentation index
[ https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-30838: - Summary: Add missing pages to documentation index (was: Add missing pages to documentation top navigation menu) > Add missing pages to documentation index > > > Key: SPARK-30838 > URL: https://issues.apache.org/jira/browse/SPARK-30838 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few pages tracked in `docs/` that are not linked to from the top > navigation or from the index page. Seems unintentional. > To make these pages easier to discover, we should at least list them in the > index and link to them from other pages as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30838) Add missing pages to documentation top navigation menu
[ https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-30838: - Description: There are a few pages tracked in `docs/` that are not linked to from the top navigation or from the index page. Seems unintentional. To make these pages easier to discover, we should at least list them in the index and link to them from other pages as appropriate. was: There are a few pages tracked in `docs/` that are not linked to from the top navigation. Seems unintentional. To make these pages easier to discover, we should listed up top and link to them from other documentation pages as appropriate. > Add missing pages to documentation top navigation menu > -- > > Key: SPARK-30838 > URL: https://issues.apache.org/jira/browse/SPARK-30838 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few pages tracked in `docs/` that are not linked to from the top > navigation or from the index page. Seems unintentional. > To make these pages easier to discover, we should at least list them in the > index and link to them from other pages as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30838) Add missing pages to documentation top navigation menu
[ https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-30838: - Description: There are a few pages tracked in `docs/` that are not linked to from the top navigation. Seems unintentional. To make these pages easier to discover, we should listed up top and link to them from other documentation pages as appropriate. was:There are a few pages tracked in `docs/` that are not linked to from the top navigation. Seems unintentional. > Add missing pages to documentation top navigation menu > -- > > Key: SPARK-30838 > URL: https://issues.apache.org/jira/browse/SPARK-30838 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few pages tracked in `docs/` that are not linked to from the top > navigation. Seems unintentional. > > To make these pages easier to discover, we should listed up top and link to > them from other documentation pages as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30838) Add missing pages to documentation top navigation menu
[ https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-30838: - Description: There are a few pages tracked in `docs/` that are not linked to from the top navigation. Seems unintentional. To make these pages easier to discover, we should listed up top and link to them from other documentation pages as appropriate. was: There are a few pages tracked in `docs/` that are not linked to from the top navigation. Seems unintentional. To make these pages easier to discover, we should listed up top and link to them from other documentation pages as appropriate. > Add missing pages to documentation top navigation menu > -- > > Key: SPARK-30838 > URL: https://issues.apache.org/jira/browse/SPARK-30838 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few pages tracked in `docs/` that are not linked to from the top > navigation. Seems unintentional. > To make these pages easier to discover, we should listed up top and link to > them from other documentation pages as appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30838) Add missing pages to documentation top navigation menu
Nicholas Chammas created SPARK-30838: Summary: Add missing pages to documentation top navigation menu Key: SPARK-30838 URL: https://issues.apache.org/jira/browse/SPARK-30838 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: Nicholas Chammas There are a few pages tracked in `docs/` that are not linked to from the top navigation. Seems unintentional. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org