[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459494#comment-17459494
 ] 

Nicholas Chammas commented on SPARK-25150:
--

I re-ran my test (described in the issue description + summarized in my comment 
just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with 
cross joins enabled or disabled, I now get the correct results.

Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran 
this test) was responsible for the fix.

But to be clear, in case anyone wants to reproduce my test:
 # Download all 6 files attached to this issue into a folder.
 # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and 
inspect the output.
 # Then, enable cross joins (commented out on line 9), rerun the script, and 
reinspect the output.
 # Compare the final bit of output from both runs against 
{{{}expected-output.txt{}}}.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: Nicholas Chammas
>Priority: Major
>  Labels: correctness
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459467#comment-17459467
 ] 

Nicholas Chammas commented on SPARK-24853:
--

[~hyukjin.kwon] - Are you still opposed to this proposed improvement? If not, 
I'd like to work on it.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26589) proper `median` method for spark dataframe

2021-12-14 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-26589.
--
Resolution: Won't Fix

Marking this as "Won't Fix", but I suppose if someone really wanted to, they 
could reopen this issue and propose adding a median function that is simply an 
alias for {{{}percentile(col, 0.5){}}}. Don't know how the committers would 
feel about that.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459455#comment-17459455
 ] 

Nicholas Chammas commented on SPARK-26589:
--

It looks like making a distributed, memory-efficient implementation of median 
is not possible using the design of Catalyst as it stands today. For more 
details, please see [this thread on the dev 
list|http://mail-archives.apache.org/mod_mbox/spark-dev/202112.mbox/%3cCAOhmDzev8d4H20XT1hUP9us=cpjeysgcf+xev7lg7dka1gj...@mail.gmail.com%3e].

It's possible to get an exact median today by using {{{}percentile(col, 
0.5){}}}, which is [available via the SQL 
API|https://spark.apache.org/docs/3.2.0/sql-ref-functions-builtin.html#aggregate-functions].
 It's not memory-efficient, so it may not work well on large datasets.

The Python and Scala DataFrame APIs do not offer this exact percentile 
function, so I've filed SPARK-37647 to track exposing this function in those 
APIs.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37647) Expose percentile function in Scala/Python APIs

2021-12-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37647:


 Summary: Expose percentile function in Scala/Python APIs
 Key: SPARK-37647
 URL: https://issues.apache.org/jira/browse/SPARK-37647
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


SQL offers a percentile function (exact, not approximate) that is not available 
directly in the Scala or Python DataFrame APIs.

While it is possible to invoke SQL functions from Scala or Python via 
{{{}expr(){}}}, I think most users expect function parity across Scala, Python, 
and SQL. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-09 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456860#comment-17456860
 ] 

Nicholas Chammas commented on SPARK-26589:
--

That makes sense to me. I've been struggling with how to approach the 
implementation, so I've posted to the dev list asking for a little more help.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452081#comment-17452081
 ] 

Nicholas Chammas commented on SPARK-26589:
--

[~srowen] - I'll ask for help on the dev list if appropriate, but I'm wondering 
if you can give me some high level guidance here.

I have an outline of an approach to calculate the median that does not require 
sorting or shuffling the data. It's based on the approach I linked to in my 
previous comment (by Michael Harris). It does require, however, multiple passes 
over the data for the algorithm to converge on the median.

Here's a working sketch of the approach:
{code:python}
def spark_median(data):
total_count = data.count()
if total_count % 2 == 0:
target_positions = [total_count // 2, total_count // 2 + 1]
else:
target_positions = [total_count // 2 + 1]
target_values = [
kth_position(data, k, data_count=total_count)
for k in target_positions
]
return sum(target_values) / len(target_values)


def kth_position(data, k, data_count=None):
if data_count is None:
total_count = data.count()
else:
total_count = data_count
if k > total_count or k < 1:
return None
while True:
# This value, along with the following two counts, are the only data 
that need
# to be shared across nodes.
some_value = data.first()["id"]
# These two counts can be performed together via an aggregator.
larger_count = data.where(col("id") > some_value).count()
equal_count = data.where(col("id") == some_value).count()
value_positions = range(
total_count - larger_count - equal_count + 1,
total_count - larger_count + 1,
)
# print(some_value, total_count, k, value_positions)
if k in value_positions:
return some_value
elif k >= value_positions.stop:
k -= (value_positions.stop - 1)
data = data.where(col("id") > some_value)
total_count = larger_count
elif k < value_positions.start:
data = data.where(col("id") < some_value)
total_count -= (larger_count + equal_count)
{code}
Of course, this needs to be converted into a Catalyst Expression, but the basic 
idea is expressed there.

I am looking at the definitions of 
[DeclarativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L381-L394]
 and 
[ImperativeAggregate|https://github.com/apache/spark/blob/1dd0ca23f64acfc7a3dc697e19627a1b74012a2d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L267-L285]
 and trying to find an existing expression to model after, but I don't think we 
have any existing aggregates that would work like this median—specifically, 
where multiple passes over the data are required (in this case, to count 
elements matching different filters).

Do you have any advice on how to approach converting this into a Catalyst 
expression?

There is an 
[NthValue|https://github.com/apache/spark/blob/568ad6aa4435ce76ca3b5d9966e64259ea1f9b38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L648-L675]
 window expression, but I don't think I can build on it to make my median 
expression since a) median shouldn't be limited to window expressions, and b) 
NthValue requires a complete sort of the data, which I want to avoid.

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-12-01 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451936#comment-17451936
 ] 

Nicholas Chammas commented on SPARK-26589:
--

Just for reference, Stack Overflow provides evidence that a proper median 
function has been in high demand for some time:
 * [How can I calculate exact median with Apache 
Spark?|https://stackoverflow.com/q/28158729/877069] (14K views)
 * [How to find median and quantiles using 
Spark|https://stackoverflow.com/q/31432843/877069] (117K views)
 * [Median / quantiles within PySpark 
groupBy|https://stackoverflow.com/q/46845672/877069] (67K views)

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26589) proper `median` method for spark dataframe

2021-11-30 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283
 ] 

Nicholas Chammas edited comment on SPARK-26589 at 11/30/21, 6:17 PM:
-

I think there is a potential solution using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].


was (Author: nchammas):
I'm going to try to implement this using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26589) proper `median` method for spark dataframe

2021-11-30 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17451283#comment-17451283
 ] 

Nicholas Chammas commented on SPARK-26589:
--

I'm going to try to implement this using the algorithm [described here by 
Michael 
Harris|https://www.quora.com/Distributed-Algorithms/What-is-the-distributed-algorithm-to-determine-the-median-of-arrays-of-integers-located-on-different-computers].

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames

2021-11-30 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-12185:
-
Labels:   (was: bulk-closed)

> Add Histogram support to Spark SQL/DataFrames
> -
>
> Key: SPARK-12185
> URL: https://issues.apache.org/jira/browse/SPARK-12185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Holden Karau
>Priority: Minor
>
> While we have the ability to compute histograms on RDDs of Doubles it would 
> be good to also directly support histograms in Spark SQL (see 
> https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions
>  ).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12185) Add Histogram support to Spark SQL/DataFrames

2021-11-30 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-12185:
--

Reopening this because I think it's a valid improvement that mirrors the 
existing {{RDD.histogram}} method.

> Add Histogram support to Spark SQL/DataFrames
> -
>
> Key: SPARK-12185
> URL: https://issues.apache.org/jira/browse/SPARK-12185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Holden Karau
>Priority: Minor
>  Labels: bulk-closed
>
> While we have the ability to compute histograms on RDDs of Doubles it would 
> be good to also directly support histograms in Spark SQL (see 
> https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining#StatisticsAndDataMining-histogram_numeric():Estimatingfrequencydistributions
>  ).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-22 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-37393.
--
Resolution: Duplicate

> Inline annotations for {ml, mllib}/common.py
> 
>
> Key: SPARK-37393
> URL: https://issues.apache.org/jira/browse/SPARK-37393
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> This will allow us to run type checks against those files themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-19 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37393:
-
Description: This will allow us to run type checks against those files 
themselves.

> Inline annotations for {ml, mllib}/common.py
> 
>
> Key: SPARK-37393
> URL: https://issues.apache.org/jira/browse/SPARK-37393
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> This will allow us to run type checks against those files themselves.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37393) Inline annotations for {ml, mllib}/common.py

2021-11-19 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37393:


 Summary: Inline annotations for {ml, mllib}/common.py
 Key: SPARK-37393
 URL: https://issues.apache.org/jira/browse/SPARK-37393
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37380) Miscellaneous Python lint infra cleanup

2021-11-18 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37380:


 Summary: Miscellaneous Python lint infra cleanup
 Key: SPARK-37380
 URL: https://issues.apache.org/jira/browse/SPARK-37380
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


* pyintrc is obsolete
 * tox.ini should be reformatted for easy reading/updating
 * requirements used on CI should be in requirements.txt



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37336:
-
Summary: Migrate _java2py to SparkSession  (was: Migrate common ML utils to 
SparkSession)

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37336) Migrate common ML utils to SparkSession

2021-11-15 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37336:


 Summary: Migrate common ML utils to SparkSession
 Key: SPARK-37336
 URL: https://issues.apache.org/jira/browse/SPARK-37336
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


{{_java2py()}} uses a deprecated method to create a SparkSession.
 
https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37335:
-
Description: 
The association rules returned by FPGrow include more columns than are 
documented, like {{{}lift{}}}:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns. An _itemset_ should also 
be briefly defined.

  was:
The association rules returned by FPGrow include more columns than are 
documented:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns.


> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37335:


 Summary: Clarify output of FPGrowth
 Key: SPARK-37335
 URL: https://issues.apache.org/jira/browse/SPARK-37335
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


The association rules returned by FPGrow include more columns than are 
documented:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394
 ] 

Nicholas Chammas edited comment on SPARK-24853 at 11/2/21, 2:41 PM:


[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:python}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.


was (Author: nchammas):
[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:java}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437396#comment-17437396
 ] 

Nicholas Chammas commented on SPARK-24853:
--

The [contributing guide|https://spark.apache.org/contributing.html] isn't clear 
on how to populate "Affects Version" for improvements, so I've just tagged the 
latest release.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-24853:
-
Priority: Minor  (was: Major)

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-24853:
--

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394
 ] 

Nicholas Chammas commented on SPARK-24853:
--

[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:java}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Major
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-24853:
-
Affects Version/s: 3.2.0

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-04-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322333#comment-17322333
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Per the discussion [on the dev 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html]
 and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to 
update the documentation to clarify that {{cleanCheckpoints}} does not impact 
shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown 
(whether planned or unplanned), and the config is currently working as intended.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.
> Evidence the current behavior is confusing:
>  * [https://stackoverflow.com/q/52630858/877069]
>  * [https://stackoverflow.com/q/60009856/877069]
>  * [https://stackoverflow.com/q/61454740/877069]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration

2021-03-10 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299283#comment-17299283
 ] 

Nicholas Chammas commented on SPARK-33436:
--

[~hyukjin.kwon] - Can you clarify please why this ticket is "Won't Fix"? Just 
so it's clear for others who come across this ticket.

Is `._jsc` the intended way for PySpark users to set S3A configs?

> PySpark equivalent of SparkContext.hadoopConfiguration
> --
>
> Key: SPARK-33436
> URL: https://issues.apache.org/jira/browse/SPARK-33436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark should offer an API to {{hadoopConfiguration}} to [match 
> Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].
> Setting Hadoop configs within a job is handy for any configurations that are 
> not appropriate as cluster defaults, or that will not be known until run 
> time. The various {{fs.s3a.*}} configs are a good example of this.
> Currently, what people are doing is setting things like this [via 
> SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-10 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33000:
-
Description: 
Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:python}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
{quote}Controls whether to clean checkpoint files if the reference is out of 
scope.
{quote}
When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.

For the record, I see the same behavior in both the Scala and Python REPLs.

Evidence the current behavior is confusing:
 * [https://stackoverflow.com/q/52630858/877069]
 * [https://stackoverflow.com/q/60009856/877069]
 * [https://stackoverflow.com/q/61454740/877069]

 

  was:
Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:python}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
{quote}Controls whether to clean checkpoint files if the reference is out of 
scope.
{quote}
When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.

For the record, I see the same behavior in both the Scala and Python REPLs.


> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.
> Evidence the current behavior is confusing:
>  * [https://stackoverflow.com/q/52630858/877069]
>  * [https://stackoverflow.com/q/60009856/877069]
>  * [https://stackoverflow.com/q/61454740/877069]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-03-04 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295531#comment-17295531
 ] 

Nicholas Chammas commented on SPARK-33000:
--

[~caowang888] - If you're still interested in this issue, take a look at my PR 
and let me know what you think. Hopefully, I've understood the issue correctly 
and proposed an appropriate fix.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-34194.
--
Resolution: Won't Fix

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-08 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281269#comment-17281269
 ] 

Nicholas Chammas commented on SPARK-34194:
--

It's not clear to me whether SPARK-26709 is describing an inherent design issue 
that has no fix, or whether SPARK-26709 simply captures a bug in the past 
implementation of {{OptimizeMetadataOnlyQuery}} which could conceivably be 
fixed in the future.

If it's something that could be fixed and reintroduced, this issue should stay 
open. If we know for design reasons that metadata-only queries cannot be made 
reliably correct, then this issue should be closed with a clear explanation to 
that effect.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869
 ] 

Nicholas Chammas edited comment on SPARK-34194 at 2/2/21, 5:56 AM:
---

Interesting reference, [~attilapiros]. It looks like that config is internal to 
Spark and was [deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.


was (Author: nchammas):
Interesting reference, [~attilapiros]. It looks like that config was 
[deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276869#comment-17276869
 ] 

Nicholas Chammas commented on SPARK-34194:
--

Interesting reference, [~attilapiros]. It looks like that config was 
[deprecated in Spark 
3.0|https://github.com/apache/spark/blob/bec80d7eec91ee83fcbb0e022b33bd526c80f423/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L918-L929]
 due to the correctness issue mentioned in that warning and documented in 
SPARK-26709.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17269571#comment-17269571
 ] 

Nicholas Chammas commented on SPARK-12890:
--

I've created SPARK-34194 and fleshed out the description of the problem a bit.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-01-21 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-34194:


 Summary: Queries that only touch partition columns shouldn't scan 
through all files
 Key: SPARK-34194
 URL: https://issues.apache.org/jira/browse/SPARK-34194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


When querying only the partition columns of a partitioned table, it seems that 
Spark nonetheless scans through all files in the table, even though it doesn't 
need to.

Here's an example:
{code:python}
>>> data = spark.read.option('mergeSchema', 
>>> 'false').parquet('s3a://some/dataset')
[Stage 0:==>  (407 + 12) / 1158]
{code}
Note the 1158 tasks. This matches the number of partitions in the table, which 
is partitioned on a single field named {{file_date}}:
{code:sh}
$ aws s3 ls s3://some/dataset | head -n 3
   PRE file_date=2017-05-01/
   PRE file_date=2017-05-02/
   PRE file_date=2017-05-03/

$ aws s3 ls s3://some/dataset | wc -l
1158
{code}
The table itself has over 138K files, though:
{code:sh}
$ aws s3 ls --recursive --human --summarize s3://some/dataset
...
Total Objects: 138708
   Total Size: 3.7 TiB
{code}
Now let's try to query just the {{file_date}} field and see what Spark does.
{code:python}
>>> data.select('file_date').orderBy('file_date', 
>>> ascending=False).limit(1).explain()
== Physical Plan ==
TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
output=[file_date#11])
+- *(1) ColumnarToRow
   +- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct<>

>>> data.select('file_date').orderBy('file_date', 
>>> ascending=False).limit(1).show()
[Stage 2:>   (179 + 12) / 41011]
{code}
Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
job progresses? I'm not sure.

What I do know is that this operation takes a long time (~20 min) running from 
my laptop, whereas to list the top-level {{file_date}} partitions via the AWS 
CLI take a second or two.

Spark appears to be going through all the files in the table, when it just 
needs to list the partitions captured in the S3 "directory" structure. The 
query is only touching {{file_date}}, after all.

The current workaround for this performance problem / optimizer wastefulness, 
is to [query the catalog directly|https://stackoverflow.com/a/65724151/877069]. 
It works, but is a lot of extra work compared to the elegant query against 
{{file_date}} that users actually intend.

Spark should somehow know when it is only querying partition fields and skip 
iterating through all the individual files in a table.

Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-18 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267657#comment-17267657
 ] 

Nicholas Chammas commented on SPARK-12890:
--

Sure, will do.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2021-01-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253044#comment-17253044
 ] 

Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM:


I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark 
should handle this situation better.


was (Author: nchammas):
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark 
should handle this situation better.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2020-12-21 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-12890:
--

Reopening because I think there is a valid potential improvement to be made 
here. If I've misunderstood, feel free to close again.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2020-12-21 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-12890:
-
Labels:   (was: bulk-closed)

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Major
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2020-12-21 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-12890:
-
Priority: Minor  (was: Major)

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2020-12-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253044#comment-17253044
 ] 

Nicholas Chammas commented on SPARK-12890:
--

I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan == 
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=>(2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
   PRE file_date=2018-10-02/
   PRE file_date=2018-10-08/
   PRE file_date=2018-10-15/ 
   ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark 
should handle this situation better.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>Priority: Major
>  Labels: bulk-closed
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration

2020-11-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33436:
-
Description: 
PySpark should offer an API to {{hadoopConfiguration}} to [match 
Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].

Setting Hadoop configs within a job is handy for any configurations that are 
not appropriate as cluster defaults, or that will not be known until run time. 
The various {{fs.s3a.*}} configs are a good example of this.

Currently, what people are doing is setting things like this [via 
SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].

  was:
PySpark should offer an API to {{hadoopConfiguration}} to [match 
Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].

Setting Hadoop configs within a job is handy for any configurations that are 
not appropriate as global defaults, or that will not be known until run time. 
The various {{fs.s3a.*}} configs are a good example of this.

Currently, what people are doing is setting things like this [via 
SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].


> PySpark equivalent of SparkContext.hadoopConfiguration
> --
>
> Key: SPARK-33436
> URL: https://issues.apache.org/jira/browse/SPARK-33436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark should offer an API to {{hadoopConfiguration}} to [match 
> Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].
> Setting Hadoop configs within a job is handy for any configurations that are 
> not appropriate as cluster defaults, or that will not be known until run 
> time. The various {{fs.s3a.*}} configs are a good example of this.
> Currently, what people are doing is setting things like this [via 
> SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration

2020-11-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33436:
-
Description: 
PySpark should offer an API to {{hadoopConfiguration}} to [match 
Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].

Setting Hadoop configs within a job is handy for any configurations that are 
not appropriate as global defaults, or that will not be known until run time. 
The various {{fs.s3a.*}} configs are a good example of this.

Currently, what people are doing is setting things like this [via 
SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].

  was:
PySpark should offer an API to {{hadoopConfiguration}} to [match 
Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].

Setting Hadoop configs within a job is handy for any configurations that are 
not appropriate as global defaults, or that will not be known until run time. 
The various {{fs.s3a.*}} configs are a good example of this.


> PySpark equivalent of SparkContext.hadoopConfiguration
> --
>
> Key: SPARK-33436
> URL: https://issues.apache.org/jira/browse/SPARK-33436
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark should offer an API to {{hadoopConfiguration}} to [match 
> Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].
> Setting Hadoop configs within a job is handy for any configurations that are 
> not appropriate as global defaults, or that will not be known until run time. 
> The various {{fs.s3a.*}} configs are a good example of this.
> Currently, what people are doing is setting things like this [via 
> SparkContext._jsc.hadoopConfiguration()|https://stackoverflow.com/a/32661336/877069].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33436) PySpark equivalent of SparkContext.hadoopConfiguration

2020-11-12 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-33436:


 Summary: PySpark equivalent of SparkContext.hadoopConfiguration
 Key: SPARK-33436
 URL: https://issues.apache.org/jira/browse/SPARK-33436
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


PySpark should offer an API to {{hadoopConfiguration}} to [match 
Scala's|http://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html#hadoopConfiguration:org.apache.hadoop.conf.Configuration].

Setting Hadoop configs within a job is handy for any configurations that are 
not appropriate as global defaults, or that will not be known until run time. 
The various {{fs.s3a.*}} configs are a good example of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33434) Document spark.conf.isModifiable()

2020-11-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33434:
-
Affects Version/s: (was: 2.4.7)
   (was: 3.0.1)
   3.1.0

> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33434) Document spark.conf.isModifiable()

2020-11-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33434:
-
Affects Version/s: 3.0.1

> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33434) Document spark.conf.isModifiable()

2020-11-12 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33434:
-
Description: 
PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears to 
be a public method introduced in SPARK-24761.

http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf

  was:PySpark's docs make no mention of {{conf.isModifiable()}}, though it 
appears to be a public method introduced in SPARK-24761.


> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 2.4.7
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33434) Document spark.conf.isModifiable()

2020-11-12 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-33434:


 Summary: Document spark.conf.isModifiable()
 Key: SPARK-33434
 URL: https://issues.apache.org/jira/browse/SPARK-33434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 2.4.7
Reporter: Nicholas Chammas


PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears to 
be a public method introduced in SPARK-24761.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache

2020-10-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218374#comment-17218374
 ] 

Nicholas Chammas commented on SPARK-26764:
--

The SPIP PDF references a design doc, but I'm not clear on where the design doc 
actually is. Is this issue supposed to be linked to some other ones?

Also, appendix B suggests to me that this idea would mesh well with the 
existing proposals to support materialized views. I could actually see this as 
an enhancement to those proposals, like SPARK-29038.

In fact, when I look at the [design 
doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit#]
 for SPARK-29038, I see that goal 3 covers automatic query rewrites, which I 
think subsumes the main benefit of this proposal as compared to "traditional" 
materialized views.
{quote}> 3. A query _rewrite_ capability to transparently rewrite a query to 
use a materialized view[1][2].
 > a. Query rewrite capability is transparent to SQL applications.
 > b. Query rewrite can be disabled at the system level or on individual 
 > materialized view. Also it can be disabled for a specified query via hint.
 > c. Query rewrite as a rule in optimizer should be made sure that it won’t 
 > cause performance regression if it can use other index or cache.
{quote}

> [SPIP] Spark Relational Cache
> -
>
> Key: SPARK-26764
> URL: https://issues.apache.org/jira/browse/SPARK-26764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Adrian Wang
>Priority: Major
> Attachments: Relational+Cache+SPIP.pdf
>
>
> In modern database systems, relational cache is a common technology to boost 
> ad-hoc queries. While Spark provides cache natively, Spark SQL should be able 
> to utilize the relationship between relations to boost all possible queries. 
> In this SPIP, we will make Spark be able to utilize all defined cached 
> relations if possible, without explicit substitution in user query, as well 
> as keep some user defined cache available in different sessions. Materialized 
> views in many database systems provide similar function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215577#comment-17215577
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Ctrl-D gracefully shuts down the Python REPL, so that should trigger the 
appropriate cleanup.

I repeated my test and did {{spark.stop()}} instead of Ctrl-D and waited 2 
minutes. Same result. No cleanup.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-16 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215469#comment-17215469
 ] 

Nicholas Chammas commented on SPARK-33000:
--

I've tested this out a bit more, and I think the original issue I reported is 
valid. Either that, or I'm still missing something.

I built Spark at the latest commit from {{master}}:
{code:java}
commit 3ae1520185e2d96d1bdbd08c989f0d48ad3ba578 (HEAD -> master, origin/master, 
origin/HEAD, apache/master)
Author: ulysses 
Date:   Fri Oct 16 11:26:27 2020 + {code}
One thing that has changed is that Spark now prevents you from setting 
{{cleanCheckpoints}} after startup:
{code:java}
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/conf.py", line 36, in set
    self._jconf.set(key, value)
  File ".../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
1304, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Cannot modify the value of a Spark config: 
spark.cleaner.referenceTracking.cleanCheckpoints; {code}
So that's good! This makes it clear to the user that this setting cannot be set 
at this time (though it could be made more helpful it explained why).

However, if I try to set the config as part of invoking PySpark, I still don't 
see any checkpointed data get cleaned up on shutdown:
{code:java}
$ rm -rf /tmp/spark/checkpoint/
$ ./bin/pyspark --conf spark.cleaner.referenceTracking.cleanCheckpoints=true

>>> spark.conf.get('spark.cleaner.referenceTracking.cleanCheckpoints')
'true'
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]                                                           
>>> 
$ du -sh /tmp/spark/checkpoint/*
32K /tmp/spark/checkpoint/57b0a413-9d47-4bcd-99ef-265e9f5c0f3b{code}
 

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215085#comment-17215085
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Thanks for the explanation! I'm happy to leave this to you if you'd like to get 
back into open source work.

If you're not sure you'll get to it in the next couple of weeks, let me know 
and I can take care of this since it's just a documentation update.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-10-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214904#comment-17214904
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Thanks for the pointer!

No need for a new ticket. I can adjust the title of this ticket and open a PR 
for the doc update.

Separate question: Do you know why there is such a restriction on when this 
configuration needs to be set? It's surprising since you can set the checkpoint 
directory after job submission.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33017) PySpark Context should have getCheckpointDir() method

2020-09-28 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-33017:


 Summary: PySpark Context should have getCheckpointDir() method
 Key: SPARK-33017
 URL: https://issues.apache.org/jira/browse/SPARK-33017
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


To match the Scala API, PySpark should offer a direct way to get the checkpoint 
dir.
{code:scala}
scala> spark.sparkContext.setCheckpointDir("/tmp/spark/checkpoint")
scala> spark.sparkContext.getCheckpointDir
res3: Option[String] = 
Some(file:/tmp/spark/checkpoint/34ebe699-bc83-4c5d-bfa2-50451296cf87)
{code}
Currently, the only was to do that from PySpark is via the underlying Java 
context:
{code:python}
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> sc._jsc.sc().getCheckpointDir().get()
'file:/tmp/spark/checkpoint/ebf0fab5-edbc-42c2-938f-65d5e599cf54'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-09-28 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-33000:
-
Description: 
Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:python}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:
{quote}Controls whether to clean checkpoint files if the reference is out of 
scope.
{quote}
When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.

For the record, I see the same behavior in both the Scala and Python REPLs.

  was:
Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:java}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:

> Controls whether to clean checkpoint files if the reference is out of scope.

When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.


> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2020-09-25 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-33000:


 Summary: cleanCheckpoints config does not clean all checkpointed 
RDDs on shutdown
 Key: SPARK-33000
 URL: https://issues.apache.org/jira/browse/SPARK-33000
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.6
Reporter: Nicholas Chammas


Maybe it's just that the documentation needs to be updated, but I found this 
surprising:
{code:java}
$ pyspark
...
>>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
>>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
>>> a = spark.range(10)
>>> a.checkpoint()
DataFrame[id: bigint]   
>>> exit(){code}
The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
Spark to clean it up on shutdown.

The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} says:

> Controls whether to clean checkpoint files if the reference is out of scope.

When Spark shuts down, everything goes out of scope, so I'd expect all 
checkpointed RDDs to be cleaned up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-08-30 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187364#comment-17187364
 ] 

Nicholas Chammas commented on SPARK-32084:
--

Can you share a couple of examples of what you are referring to in this ticket?

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies

2020-08-25 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31167:
-
Description: Ideally, we should have a single place to track Python 
development dependencies and reuse it in all the relevant places: developer 
docs, Dockerfile, and GitHub CI. Where appropriate, we should pin dependencies 
to ensure a reproducible Python environment.  (was: Ideally, we should have a 
single place to track Python development dependencies and reuse it in all the 
relevant places: developer docs, Dockerfile, )

> Refactor how we track Python test/build dependencies
> 
>
> Key: SPARK-31167
> URL: https://issues.apache.org/jira/browse/SPARK-31167
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Ideally, we should have a single place to track Python development 
> dependencies and reuse it in all the relevant places: developer docs, 
> Dockerfile, and GitHub CI. Where appropriate, we should pin dependencies to 
> ensure a reproducible Python environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies

2020-08-25 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31167:
-
Description: Ideally, we should have a single place to track Python 
development dependencies and reuse it in all the relevant places: developer 
docs, Dockerfile, 

> Refactor how we track Python test/build dependencies
> 
>
> Key: SPARK-31167
> URL: https://issues.apache.org/jira/browse/SPARK-31167
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Ideally, we should have a single place to track Python development 
> dependencies and reuse it in all the relevant places: developer docs, 
> Dockerfile, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-32686:


 Summary: Un-deprecate inferring DataFrame schema from list of 
dictionaries
 Key: SPARK-32686
 URL: https://issues.apache.org/jira/browse/SPARK-32686
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


Inferring the schema of a DataFrame from a list of dictionaries feels natural 
for PySpark users, and also mirrors [basic functionality in 
Pandas|https://stackoverflow.com/a/20638258/877069].

This is currently possible in PySpark but comes with a deprecation warning. We 
should un-deprecate this behavior if there are no deeper reasons to discourage 
users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32686) Un-deprecate inferring DataFrame schema from list of dictionaries

2020-08-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182040#comment-17182040
 ] 

Nicholas Chammas commented on SPARK-32686:
--

Not sure if I have the "Affects Version" field set appropriately.

> Un-deprecate inferring DataFrame schema from list of dictionaries
> -
>
> Key: SPARK-32686
> URL: https://issues.apache.org/jira/browse/SPARK-32686
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Inferring the schema of a DataFrame from a list of dictionaries feels natural 
> for PySpark users, and also mirrors [basic functionality in 
> Pandas|https://stackoverflow.com/a/20638258/877069].
> This is currently possible in PySpark but comes with a deprecation warning. 
> We should un-deprecate this behavior if there are no deeper reasons to 
> discourage users from this API, beyond wanting to push them to use {{Row}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

2020-04-22 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089917#comment-17089917
 ] 

Nicholas Chammas commented on SPARK-27623:
--

Is this perhaps just a documentation issue? i.e. The documentation 
[here|http://spark.apache.org/docs/2.4.5/sql-data-sources-avro.html#deploying] 
should use 2.11 instead of 2.12, or at least clarify how to identify what to 
use.

The [downloads page|http://spark.apache.org/downloads.html] does say:
{quote}Note that, Spark is pre-built with Scala 2.11 except version 2.4.2, 
which is pre-built with Scala 2.12.
{quote}
Which I think explains the behavior people are reporting here.

> Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> ---
>
> Key: SPARK-27623
> URL: https://issues.apache.org/jira/browse/SPARK-27623
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.2
>Reporter: Alexandru Barbulescu
>Priority: Major
>
> After updating to spark 2.4.2 when using the 
> {code:java}
> spark.read.format().options().load()
> {code}
>  
> chain of methods, regardless of what parameter is passed to "format" we get 
> the following error related to avro:
>  
> {code:java}
> - .options(**load_options)
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 
> 172, in load
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
> deco
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
> - : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> - at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> - at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> - at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> - at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> - at scala.collection.Iterator.foreach(Iterator.scala:941)
> - at scala.collection.Iterator.foreach$(Iterator.scala:941)
> - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> - at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> - at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> - at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
> - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
> - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
> - at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
> - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
> - at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
> - at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> - at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> - at java.lang.reflect.Method.invoke(Method.java:498)
> - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> - at py4j.Gateway.invoke(Gateway.java:282)
> - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> - at py4j.commands.CallCommand.execute(CallCommand.java:79)
> - at py4j.GatewayConnection.run(GatewayConnection.java:238)
> - at java.lang.Thread.run(Thread.java:748)
> - Caused by: java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/execution/datasources/FileFormat$class
> - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44)
> - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> - at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> - at java.lang.Class.newInstance(Class.java:442)
> - at 

[jira] [Commented] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-04-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088997#comment-17088997
 ] 

Nicholas Chammas commented on SPARK-31170:
--

Isn't this also an issue in Spark 2.4.5?
{code:java}
$ spark-sql
...
20/04/21 15:12:26 INFO metastore: Mestastore configuration 
hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
file:/Users/myusername/spark-warehouse
...
spark-sql> create table test(a int);
...
20/04/21 15:12:48 WARN HiveMetaStore: Location: file:/user/hive/warehouse/test 
specified for non-external table:test
20/04/21 15:12:48 INFO FileUtils: Creating directory if it doesn't exist: 
file:/user/hive/warehouse/test
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:file:/user/hive/warehouse/test is not a directory or 
unable to create one);{code}
If I understood the problem correctly, then the fix should perhaps also be 
backported to 2.4.x.

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074217#comment-17074217
 ] 

Nicholas Chammas commented on SPARK-31330:
--

Hmm, I didn't see anything from you on the mailing list. But thanks for these 
references! This is very helpful.

Looks like you had Infra enable autolabeler for the Avro project over in 
INFRA-17367. I will ask Infra to do the same for Spark and cc [~hyukjin.kwon] 
for committer approval (which I guess Infra may ask for).

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074124#comment-17074124
 ] 

Nicholas Chammas commented on SPARK-31330:
--

Unfortunately, it seems I jumped the gun on sending that dev email about the 
GitHub PR labeler action.

It has a fundamental limitation that currently makes it [useless for 
us|https://github.com/actions/labeler/tree/d2c408e7ed8498dfdf675c5f8d133ab37b6f8520#pull-request-labeler]:
{quote}Note that only pull requests being opened from the same repository can 
be labeled. This action will not currently work for pull requests from forks – 
like is common in open source projects – because the token for forked pull 
request workflows does not have write permissions.
{quote}
Additional detail: 
[https://github.com/actions/labeler/issues/12#issuecomment-525762657]

I'll keep my eye on that Action in case they somehow lift or work around the 
limitation on forked repositories.

Of course, we can always implement this functionality ourselves, but the 
attraction of the GitHub Action was that we could reuse an existing, tested, 
and widely adopted implementation.

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-02 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31330:


 Summary: Automatically label PRs based on the paths they touch
 Key: SPARK-31330
 URL: https://issues.apache.org/jira/browse/SPARK-31330
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


We can potentially leverage the added labels to drive testing, review, or other 
project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies

2020-03-16 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31167:
-
Summary: Refactor how we track Python test/build dependencies  (was: 
Refactor how we track Python test dependencies)

> Refactor how we track Python test/build dependencies
> 
>
> Key: SPARK-31167
> URL: https://issues.apache.org/jira/browse/SPARK-31167
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31167) Refactor how we track Python test dependencies

2020-03-16 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31167:
-
Summary: Refactor how we track Python test dependencies  (was: Specify 
missing test dependencies)

> Refactor how we track Python test dependencies
> --
>
> Key: SPARK-31167
> URL: https://issues.apache.org/jira/browse/SPARK-31167
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31167) Specify missing test dependencies

2020-03-16 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31167:


 Summary: Specify missing test dependencies
 Key: SPARK-31167
 URL: https://issues.apache.org/jira/browse/SPARK-31167
 Project: Spark
  Issue Type: Improvement
  Components: Build, Tests
Affects Versions: 3.1.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31155) Remove pydocstyle tests

2020-03-16 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31155:
-
Summary: Remove pydocstyle tests  (was: Enable pydocstyle tests)

> Remove pydocstyle tests
> ---
>
> Key: SPARK-31155
> URL: https://issues.apache.org/jira/browse/SPARK-31155
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> pydocstyle tests have been running neither on Jenkins nor on Github.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31155) Remove pydocstyle tests

2020-03-16 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31155:
-
Description: 
pydocstyle tests have been running neither on Jenkins nor on Github.

We also seem to be in a [bad 
place|https://github.com/apache/spark/pull/27912#issuecomment-599167117] to 
re-enable them.

  was:pydocstyle tests have been running neither on Jenkins nor on Github.


> Remove pydocstyle tests
> ---
>
> Key: SPARK-31155
> URL: https://issues.apache.org/jira/browse/SPARK-31155
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> pydocstyle tests have been running neither on Jenkins nor on Github.
> We also seem to be in a [bad 
> place|https://github.com/apache/spark/pull/27912#issuecomment-599167117] to 
> re-enable them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29280) DataFrameReader should support a compression option

2020-03-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-29280:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> DataFrameReader should support a compression option
> ---
>
> Key: SPARK-29280
> URL: https://issues.apache.org/jira/browse/SPARK-29280
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [DataFrameWriter|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter]
>  supports a {{compression}} option, but 
> [DataFrameReader|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
>  doesn't. The lack of a {{compression}} option in the reader causes some 
> friction in the following cases:
>  # You want to read some data compressed with a codec that Spark does not 
> [load by 
> default|http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization].
>  # You want to read some data with a codec that overrides one of the built-in 
> codecs that Spark supports.
>  # You want to explicitly instruct Spark on what codec to use on read when it 
> will not be able to correctly auto-detect it (e.g. because the file extension 
> is [missing,|https://stackoverflow.com/q/52011697/877069] 
> [non-standard|https://stackoverflow.com/q/44372995/877069], or 
> [incorrect|https://stackoverflow.com/q/49110384/877069]).
> Case #2 came up in SPARK-29102. There is a very handy library called 
> [SplittableGzip|https://github.com/nielsbasjes/splittablegzip] that lets you 
> load a single gzipped file using multiple concurrent tasks. (You can see the 
> details of how it works and why it's useful in the project README and in 
> SPARK-29102.)
> To use this codec, I had to set {{io.compression.codecs}}. I guess this is a 
> Hadoop filesystem API setting, since it [doesn't appear to be documented by 
> Spark|http://spark.apache.org/docs/latest/configuration.html]. Confusingly, 
> there is also a setting called {{spark.io.compression.codec}}, which seems to 
> be for a different purpose.
> It would be much clearer for the user and more consistent with the writer 
> interface if the reader let you directly specify the codec.
> For example, I think all of the following should be possible:
> {code:python}
> spark.read.option('compression', 'lz4').csv(...)
> spark.read.csv(..., 
> compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec')
> spark.read.json(..., compression='none')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31155) Enable pydocstyle tests

2020-03-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31155:


 Summary: Enable pydocstyle tests
 Key: SPARK-31155
 URL: https://issues.apache.org/jira/browse/SPARK-31155
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


pydocstyle tests have been running neither on Jenkins nor on Github.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31153) Cleanup several failures in lint-python

2020-03-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31153:


 Summary: Cleanup several failures in lint-python
 Key: SPARK-31153
 URL: https://issues.apache.org/jira/browse/SPARK-31153
 Project: Spark
  Issue Type: Bug
  Components: Build, PySpark
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


Don't understand how this script runs fine on the build server. Perhaps we've 
just been getting lucky?

Will detail the issues on the PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31075) Add documentation for ALTER TABLE ... ADD PARTITION

2020-03-09 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-31075.
--
Resolution: Duplicate

> Add documentation for ALTER TABLE ... ADD PARTITION
> ---
>
> Key: SPARK-31075
> URL: https://issues.apache.org/jira/browse/SPARK-31075
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Our docs for {{ALTER TABLE}} 
> [currently|https://github.com/apache/spark/blob/cba17e07e9f15673f274de1728f6137d600026e1/docs/sql-ref-syntax-ddl-alter-table.md]
>  make no mention of {{ADD PARTITION}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-08 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054676#comment-17054676
 ] 

Nicholas Chammas commented on SPARK-31043:
--

It's working for me now (per my comment), but when I was seeing the issue 
simply starting a PySpark shell was enough to throw that error. Do you still 
need a reproduction?

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 3.0.0
>
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-08 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054674#comment-17054674
 ] 

Nicholas Chammas commented on SPARK-31065:
--

Thanks for looking into it.

I have a silly question: Why isn't {{schema_of_json()}} simply syntactic sugar 
for {{spark.read.json()}}?

For example:
{code:python}
def _schema_of_json(string):
df = spark.read.json(spark.sparkContext.parallelize([string]))
return df.schema{code}
Perhaps there are practical reasons not to do this, but conceptually speaking 
this kind of equivalence should hold. Yet this bug report demonstrates that 
they are not equivalent.
{code:python}
from pyspark.sql.functions import from_json, schema_of_json
json = '{"a": ""}'

df = spark.createDataFrame([(json,)], schema=['json'])
df.show()

# chokes with org.apache.spark.sql.catalyst.parser.ParseException
json_schema = schema_of_json(json)
df.select(from_json('json', json_schema))

# works fine
json_schema = _schema_of_json(json)
df.select(from_json('json', json_schema)) {code}

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 

[jira] [Created] (SPARK-31075) Add documentation for ALTER TABLE ... ADD PARTITION

2020-03-06 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31075:


 Summary: Add documentation for ALTER TABLE ... ADD PARTITION
 Key: SPARK-31075
 URL: https://issues.apache.org/jira/browse/SPARK-31075
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


Our docs for {{ALTER TABLE}} 
[currently|https://github.com/apache/spark/blob/cba17e07e9f15673f274de1728f6137d600026e1/docs/sql-ref-syntax-ddl-alter-table.md]
 make no mention of {{ADD PARTITION}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Show Maven errors from within make-distribution.sh

2020-03-06 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:
{code:java}
./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud {code}
 
 But this doesn't:
{code:java}
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip{code}
 

The latter invocation yields the following, confusing output:
{code:java}
 + VERSION=' -X,--debug Produce execution debug output'{code}
 That's because Maven is accepting {{--pip}} as an option and failing, but the 
user doesn't get to see the error from Maven.

  was:
This works:
{code:java}
./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud {code}
 
 But this doesn't:
{code:java}
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip{code}
 

The latter invocation yields the following, confusing output:
{code:java}
 + VERSION=' -X,--debug Produce execution debug output'{code}
 


> Show Maven errors from within make-distribution.sh
> --
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  That's because Maven is accepting {{--pip}} as an option and failing, but 
> the user doesn't get to see the error from Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Show Maven errors from within make-distribution.sh

2020-03-06 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Summary: Show Maven errors from within make-distribution.sh  (was: Make 
arguments to make-distribution.sh position-independent)

> Show Maven errors from within make-distribution.sh
> --
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053080#comment-17053080
 ] 

Nicholas Chammas commented on SPARK-31043:
--

FWIW I was seeing the same {{java.lang.NoClassDefFoundError: 
org/w3c/dom/ElementTraversal}} issue on {{branch-3.0}} and pulling the latest 
changes fixed it for me too.

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 3.0.0
>
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053079#comment-17053079
 ] 

Nicholas Chammas commented on SPARK-31065:
--

Confirmed this issue is also present on {{branch-3.0}} as of commit 
{{9b48f3358d3efb523715a5f258e5ed83e28692f6}}.

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)

[jira] [Updated] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31065:
-
Affects Version/s: 3.0.0

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527)
>   at 

[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053052#comment-17053052
 ] 

Nicholas Chammas commented on SPARK-31065:
--

cc [~hyukjin.kwon]

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
>   at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527)
> 

[jira] [Created] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-05 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31065:


 Summary: Empty string values cause schema_of_json() to return a 
schema not usable by from_json()
 Key: SPARK-31065
 URL: https://issues.apache.org/jira/browse/SPARK-31065
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Nicholas Chammas


Here's a reproduction:
  
{code:python}
from pyspark.sql.functions import from_json, schema_of_json
json = '{"a": ""}'

df = spark.createDataFrame([(json,)], schema=['json'])
df.show()

# chokes with org.apache.spark.sql.catalyst.parser.ParseException
json_schema = schema_of_json(json)
df.select(from_json('json', json_schema))

# works fine
json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
df.select(from_json('json', json_schema))
{code}
The output:
{code:java}
>>> from pyspark.sql.functions import from_json, schema_of_json
>>> json = '{"a": ""}'
>>> 
>>> df = spark.createDataFrame([(json,)], schema=['json'])
>>> df.show()
+-+
| json|
+-+
|{"a": ""}|
+-+

>>> 
>>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
>>> json_schema = schema_of_json(json)
>>> df.select(from_json('json', json_schema))
Traceback (most recent call last):
  File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File 
".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.functions.from_json.
: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 
'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 
'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 
'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 
'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 
'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', 
'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 
'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 
'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 
'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 
'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 
'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 
'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 
'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 
'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, 
BACKQUOTED_IDENTIFIER}(line 1, pos 6)

== SQL ==
struct
--^^^

at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:64)
at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:123)
at 
org.apache.spark.sql.catalyst.expressions.JsonExprUtils$.evalSchemaExpr(jsonExpressions.scala:777)
at 
org.apache.spark.sql.catalyst.expressions.JsonToStructs.(jsonExpressions.scala:527)
at org.apache.spark.sql.functions$.from_json(functions.scala:3606)
at org.apache.spark.sql.functions.from_json(functions.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at 

[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:
{code:java}
./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud {code}
 
 But this doesn't:
{code:java}
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip{code}
 

The latter invocation yields the following, confusing output:
{code:java}
 + VERSION=' -X,--debug Produce execution debug output'{code}
 

  was:
This works:

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud

```
  
 But this doesn't:

```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```


> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud

```
  
 But this doesn't:

```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```

  was:
This works:

 

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud
 ```
  
 But this doesn't:
  
 ```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```


> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> ```
>  ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud
> ```
>   
>  But this doesn't:
> ```
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:

 

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud
 ```
  
 But this doesn't:
  
 ```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```

  was:
This works:

 

```
./dev/make-distribution.sh \
--pip \
-Phadoop-2.7 -Phive -Phadoop-cloud
```
 
But this doesn't:
 
```
./dev/make-distribution.sh \
-Phadoop-2.7 -Phive -Phadoop-cloud \
--pip```


> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
>  
> ```
>  ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud
>  ```
>   
>  But this doesn't:
>   
>  ```
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Summary: Make arguments to make-distribution.sh position-independent  (was: 
Make argument to make-distribution position-independent)

> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
>  
> ```
> ./dev/make-distribution.sh \
> --pip \
> -Phadoop-2.7 -Phive -Phadoop-cloud
> ```
>  
> But this doesn't:
>  
> ```
> ./dev/make-distribution.sh \
> -Phadoop-2.7 -Phive -Phadoop-cloud \
> --pip```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31041) Make argument to make-distribution position-independent

2020-03-04 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31041:


 Summary: Make argument to make-distribution position-independent
 Key: SPARK-31041
 URL: https://issues.apache.org/jira/browse/SPARK-31041
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


This works:

 

```
./dev/make-distribution.sh \
--pip \
-Phadoop-2.7 -Phive -Phadoop-cloud
```
 
But this doesn't:
 
```
./dev/make-distribution.sh \
-Phadoop-2.7 -Phive -Phadoop-cloud \
--pip```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

2020-03-01 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31001:


 Summary: Add ability to create a partitioned table via 
catalog.createTable()
 Key: SPARK-31001
 URL: https://issues.apache.org/jira/browse/SPARK-31001
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


There doesn't appear to be a way to create a partitioned table using the 
Catalog interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31000) Add ability to set table description in the catalog

2020-03-01 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31000:


 Summary: Add ability to set table description in the catalog
 Key: SPARK-31000
 URL: https://issues.apache.org/jira/browse/SPARK-31000
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


It seems that the catalog supports a {{description}} attribute on tables.

https://github.com/apache/spark/blob/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/scala/org/apache/spark/sql/catalog/interface.scala#L68

However, the {{createTable()}} interface doesn't provide any way to set that 
attribute.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30838) Add missing pages to documentation index

2020-02-18 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-30838.
--
Resolution: Won't Fix

> Add missing pages to documentation index
> 
>
> Key: SPARK-30838
> URL: https://issues.apache.org/jira/browse/SPARK-30838
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few pages tracked in `docs/` that are not linked to from the top 
> navigation or from the index page. Seems unintentional.
> To make these pages easier to discover, we should at least list them in the 
> index and link to them from other pages as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30838) Add missing pages to documentation index

2020-02-18 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039351#comment-17039351
 ] 

Nicholas Chammas commented on SPARK-30838:
--

Actually, it looks like the pages I wanted to add (hadoop-provided, cloud 
infra, web ui) already have references from one place or other. They're not as 
discoverable as I'd like, but I don't have a good idea as to how to improve 
things right now so I'm going to close this.

> Add missing pages to documentation index
> 
>
> Key: SPARK-30838
> URL: https://issues.apache.org/jira/browse/SPARK-30838
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few pages tracked in `docs/` that are not linked to from the top 
> navigation or from the index page. Seems unintentional.
> To make these pages easier to discover, we should at least list them in the 
> index and link to them from other pages as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30838) Add missing pages to documentation index

2020-02-18 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30838:
-
Summary: Add missing pages to documentation index  (was: Add missing pages 
to documentation top navigation menu)

> Add missing pages to documentation index
> 
>
> Key: SPARK-30838
> URL: https://issues.apache.org/jira/browse/SPARK-30838
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few pages tracked in `docs/` that are not linked to from the top 
> navigation or from the index page. Seems unintentional.
> To make these pages easier to discover, we should at least list them in the 
> index and link to them from other pages as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30838) Add missing pages to documentation top navigation menu

2020-02-18 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30838:
-
Description: 
There are a few pages tracked in `docs/` that are not linked to from the top 
navigation or from the index page. Seems unintentional.

To make these pages easier to discover, we should at least list them in the 
index and link to them from other pages as appropriate.

  was:
There are a few pages tracked in `docs/` that are not linked to from the top 
navigation. Seems unintentional.

To make these pages easier to discover, we should listed up top and link to 
them from other documentation pages as appropriate.


> Add missing pages to documentation top navigation menu
> --
>
> Key: SPARK-30838
> URL: https://issues.apache.org/jira/browse/SPARK-30838
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few pages tracked in `docs/` that are not linked to from the top 
> navigation or from the index page. Seems unintentional.
> To make these pages easier to discover, we should at least list them in the 
> index and link to them from other pages as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30838) Add missing pages to documentation top navigation menu

2020-02-14 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30838:
-
Description: 
There are a few pages tracked in `docs/` that are not linked to from the top 
navigation. Seems unintentional.

 

To make these pages easier to discover, we should listed up top and link to 
them from other documentation pages as appropriate.

  was:There are a few pages tracked in `docs/` that are not linked to from the 
top navigation. Seems unintentional.


> Add missing pages to documentation top navigation menu
> --
>
> Key: SPARK-30838
> URL: https://issues.apache.org/jira/browse/SPARK-30838
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few pages tracked in `docs/` that are not linked to from the top 
> navigation. Seems unintentional.
>  
> To make these pages easier to discover, we should listed up top and link to 
> them from other documentation pages as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30838) Add missing pages to documentation top navigation menu

2020-02-14 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30838:
-
Description: 
There are a few pages tracked in `docs/` that are not linked to from the top 
navigation. Seems unintentional.

To make these pages easier to discover, we should listed up top and link to 
them from other documentation pages as appropriate.

  was:
There are a few pages tracked in `docs/` that are not linked to from the top 
navigation. Seems unintentional.

 

To make these pages easier to discover, we should listed up top and link to 
them from other documentation pages as appropriate.


> Add missing pages to documentation top navigation menu
> --
>
> Key: SPARK-30838
> URL: https://issues.apache.org/jira/browse/SPARK-30838
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> There are a few pages tracked in `docs/` that are not linked to from the top 
> navigation. Seems unintentional.
> To make these pages easier to discover, we should listed up top and link to 
> them from other documentation pages as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30838) Add missing pages to documentation top navigation menu

2020-02-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30838:


 Summary: Add missing pages to documentation top navigation menu
 Key: SPARK-30838
 URL: https://issues.apache.org/jira/browse/SPARK-30838
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Nicholas Chammas


There are a few pages tracked in `docs/` that are not linked to from the top 
navigation. Seems unintentional.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >