date:20200130

[jira] [Commented] (SPARK-30670) Pipes for PySpark

2020-01-30 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027244#comment-17027244
 ] 

Hyukjin Kwon commented on SPARK-30670:
--

Yes, it's intentional in order to match it with Scala side. It is easily 
workaround - you can just directly pass the arguments inside to the function.

> Pipes for PySpark
> -
>
> Key: SPARK-30670
> URL: https://issues.apache.org/jira/browse/SPARK-30670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Vincent
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
> functional programming pattern that is inspired from the tidyverse that is 
> currently missing. The pandas community also recently adopted this pattern, 
> documented [here]([https://tomaugspurger.github.io/method-chaining.html).]
> This is the idea. Suppose you had this;
> {code:java}
> # file that has [user, date, timestamp, eventtype]
> ddf = spark.read.parquet("")
> w_user = Window().partitionBy("user")
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> thres_sesstime = 60 * 15 
> min_n_rows = 10
> min_n_sessions = 5
> clean_ddf = (ddf
>   .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
>   .withColumn("new_session", (sf.col("delta") > 
> thres_sesstime).cast("integer"))
>   .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
>   .drop("new_session")
>   .drop("delta")
>   .withColumn("nrow_user", sf.count(sf.col("timestamp")))
>   .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
>   .filter(sf.col("nrow_user") > min_n_rows)
>   .filter(sf.col("nrow_user_date") > min_n_sessions)
>   .drop("nrow_user")
>   .drop("nrow_user_date"))
> {code}
> The code works and it is somewhat clear. We add a session to the dataframe 
> and then we use this to remove outliers. The issue is that this chain of 
> commands can get quite long so instead it might be better to turn this into 
> functions.
> {code:java}
> def add_session(dataf, session_threshold=60*15):
> w_user = Window().partitionBy("user")
>   
> return (dataf  
> .withColumn("delta", sf.col("timestamp") - 
> sf.lag("timestamp").over(w_user))
> .withColumn("new_session", (sf.col("delta") > 
> threshold_sesstime).cast("integer"))
> .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
> .drop("new_session")
> .drop("delta"))
> def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> 
> return (dataf  
> .withColumn("nrow_user", sf.count(sf.col("timestamp")))
> .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
> .filter(sf.col("nrow_user") > min_n_rows)
> .filter(sf.col("nrow_user_date") > min_n_sessions)
> .drop("nrow_user")
> .drop("nrow_user_date"))
> {code}
> The issue lies not in these functions. These functions are great! You can 
> unit test them and they really give nice verbs that function as an 
> abstraction. The issue is in how you now need to apply them. 
> {code:java}
> remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
> {code}
> It'd be much nicer to perhaps allow for this;
> {code:java}
> (ddf
>   .pipe(add_session, session_threshold=900)
>   .pipe(remove_outliers, min_n_rows=11))
> {code}
> The cool thing about this is that you can really easily allow for method 
> chaining but also that you have an amazing way to split high level code and 
> low level code. You still allow mutation as a high level by exposing keyword 
> arguments but you can easily find the lower level code in debugging because 
> you've contained details to their functions.
> For code maintenance, I've relied on this pattern a lot personally. But 
> sofar, I've always monkey-patched spark to be able to do this.
> {code:java}
> from pyspark.sql import DataFrame 
> def pipe(self, func, *args, **kwargs):
> return func(self, *args, **kwargs)
> {code}
> Could I perhaps add these few lines of code to the codebase?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30671) SparkSession emptyDataFrame should not create an RDD

2020-01-30 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30671.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27400
[https://github.com/apache/spark/pull/27400]

> SparkSession emptyDataFrame should not create an RDD
> 
>
> Key: SPARK-30671
> URL: https://issues.apache.org/jira/browse/SPARK-30671
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.0.0
>
>
> SparkSession.emptyDataFrame uses an empty RDD in the background. This is a 
> bit of a pity because the optimizer can't recognize this as being empty and 
> it won't apply things like {{PropagateEmptyRelation}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30670) Pipes for PySpark

2020-01-30 Thread Vincent (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027242#comment-17027242
 ] 

Vincent commented on SPARK-30670:
-

I just had a look, but transform does not allow for `*args` and `**kwargs`. Is 
there a reason for this? To me this feels like it is not feature-complete.

> Pipes for PySpark
> -
>
> Key: SPARK-30670
> URL: https://issues.apache.org/jira/browse/SPARK-30670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Vincent
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would propose to add a `pipe` method to a Spark Dataframe. It allows for a 
> functional programming pattern that is inspired from the tidyverse that is 
> currently missing. The pandas community also recently adopted this pattern, 
> documented [here]([https://tomaugspurger.github.io/method-chaining.html).]
> This is the idea. Suppose you had this;
> {code:java}
> # file that has [user, date, timestamp, eventtype]
> ddf = spark.read.parquet("")
> w_user = Window().partitionBy("user")
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> thres_sesstime = 60 * 15 
> min_n_rows = 10
> min_n_sessions = 5
> clean_ddf = (ddf
>   .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user))
>   .withColumn("new_session", (sf.col("delta") > 
> thres_sesstime).cast("integer"))
>   .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
>   .drop("new_session")
>   .drop("delta")
>   .withColumn("nrow_user", sf.count(sf.col("timestamp")))
>   .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
>   .filter(sf.col("nrow_user") > min_n_rows)
>   .filter(sf.col("nrow_user_date") > min_n_sessions)
>   .drop("nrow_user")
>   .drop("nrow_user_date"))
> {code}
> The code works and it is somewhat clear. We add a session to the dataframe 
> and then we use this to remove outliers. The issue is that this chain of 
> commands can get quite long so instead it might be better to turn this into 
> functions.
> {code:java}
> def add_session(dataf, session_threshold=60*15):
> w_user = Window().partitionBy("user")
>   
> return (dataf  
> .withColumn("delta", sf.col("timestamp") - 
> sf.lag("timestamp").over(w_user))
> .withColumn("new_session", (sf.col("delta") > 
> threshold_sesstime).cast("integer"))
> .withColumn("session", sf.sum(sf.col("new_session")).over(w_user))
> .drop("new_session")
> .drop("delta"))
> def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5):
> w_user_date = Window().partitionBy("user", "date")
> w_user_time = Window().partitionBy("user").sortBy("timestamp")
> 
> return (dataf  
> .withColumn("nrow_user", sf.count(sf.col("timestamp")))
> .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date")))
> .filter(sf.col("nrow_user") > min_n_rows)
> .filter(sf.col("nrow_user_date") > min_n_sessions)
> .drop("nrow_user")
> .drop("nrow_user_date"))
> {code}
> The issue lies not in these functions. These functions are great! You can 
> unit test them and they really give nice verbs that function as an 
> abstraction. The issue is in how you now need to apply them. 
> {code:java}
> remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11)
> {code}
> It'd be much nicer to perhaps allow for this;
> {code:java}
> (ddf
>   .pipe(add_session, session_threshold=900)
>   .pipe(remove_outliers, min_n_rows=11))
> {code}
> The cool thing about this is that you can really easily allow for method 
> chaining but also that you have an amazing way to split high level code and 
> low level code. You still allow mutation as a high level by exposing keyword 
> arguments but you can easily find the lower level code in debugging because 
> you've contained details to their functions.
> For code maintenance, I've relied on this pattern a lot personally. But 
> sofar, I've always monkey-patched spark to be able to do this.
> {code:java}
> from pyspark.sql import DataFrame 
> def pipe(self, func, *args, **kwargs):
> return func(self, *args, **kwargs)
> {code}
> Could I perhaps add these few lines of code to the codebase?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26449) Missing Dataframe.transform API in Python API

2020-01-30 Thread Vincent (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027241#comment-17027241
 ] 

Vincent commented on SPARK-26449:
-

Is there a reason why transform does not accept `*args` and **kwargs`?

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Assignee: Erik Christiansen
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30690) Expose the documentation of CalendarInternval in API documentation

2020-01-30 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-30690:


 Summary: Expose the documentation of CalendarInternval in API 
documentation
 Key: SPARK-30690
 URL: https://issues.apache.org/jira/browse/SPARK-30690
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


We should also expose it in documentation as we marked it as unstable API as of 
SPARK-30547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30065:
--
Fix Version/s: 2.4.5

> Unable to drop na with duplicate columns
> 
>
> Key: SPARK-30065
> URL: https://issues.apache.org/jira/browse/SPARK-30065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Trying to drop rows with null values fails even when no columns are 
> specified. This should be allowed:
> {code:java}
> scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
> left: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
> right: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val df = left.join(right, Seq("col1"))
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more 
> field]
> scala> df.show
> ++++
> |col1|col2|col2|
> ++++
> |   1|null|   2|
> |   3|   4|null|
> ++++
> scala> df.na.drop("any")
> org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could 
> be: col2, col2.;
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30065) Unable to drop na with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027239#comment-17027239
 ] 

Dongjoon Hyun commented on SPARK-30065:
---

This is backported to `branch-2.4` via 
[https://github.com/apache/spark/pull/27411] .

> Unable to drop na with duplicate columns
> 
>
> Key: SPARK-30065
> URL: https://issues.apache.org/jira/browse/SPARK-30065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Trying to drop rows with null values fails even when no columns are 
> specified. This should be allowed:
> {code:java}
> scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
> left: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
> right: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val df = left.join(right, Seq("col1"))
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more 
> field]
> scala> df.show
> ++++
> |col1|col2|col2|
> ++++
> |   1|null|   2|
> |   3|   4|null|
> ++++
> scala> df.na.drop("any")
> org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could 
> be: col2, col2.;
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming

2020-01-30 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-30669.
-
Fix Version/s: 3.0.0
 Assignee: Burak Yavuz
   Resolution: Done

Resolved by [https://github.com/apache/spark/pull/27380]

> Introduce AdmissionControl API to Structured Streaming
> --
>
> Key: SPARK-30669
> URL: https://issues.apache.org/jira/browse/SPARK-30669
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> In Structured Streaming, we have the concept of Triggers. With a trigger like 
> Trigger.Once(), the semantics are to process all the data available to the 
> datasource in a single micro-batch. However, this semantic can be broken when 
> data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate 
> limit the amount of data read for that micro-batch.
> We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. 
> A ReadLimit defines how much data should be read in the next micro-batch. 
> `SupportsAdmissionControl` specifies that a source can rate limit its ingest 
> into the system. The source can tell the system what the user specified as a 
> read limit, and the system can enforce this limit within each micro-batch or 
> impose it's own limit if the Trigger is Trigger.Once() for example.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30362) InputMetrics are not updated for DataSourceRDD V2

2020-01-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30362:
---

Assignee: Sandeep Katta

> InputMetrics are not updated for DataSourceRDD V2 
> --
>
> Key: SPARK-30362
> URL: https://issues.apache.org/jira/browse/SPARK-30362
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Attachments: inputMetrics.png
>
>
> InputMetrics is not updated for DataSourceRDD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30362) InputMetrics are not updated for DataSourceRDD V2

2020-01-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30362.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27021
[https://github.com/apache/spark/pull/27021]

> InputMetrics are not updated for DataSourceRDD V2 
> --
>
> Key: SPARK-30362
> URL: https://issues.apache.org/jira/browse/SPARK-30362
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: inputMetrics.png
>
>
> InputMetrics is not updated for DataSourceRDD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29665) refine the TableProvider interface

2020-01-30 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29665.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26868
[https://github.com/apache/spark/pull/26868]

> refine the TableProvider interface
> --
>
> Key: SPARK-29665
> URL: https://issues.apache.org/jira/browse/SPARK-29665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2020-01-30 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027180#comment-17027180
 ] 

Wenchen Fan commented on SPARK-26021:
-

FYI: This fix has been replaced by a better fix that retains the difference 
between -0.0 and 0.0, by https://github.com/apache/spark/pull/23388

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean R. Owen
>Assignee: Alon Doron
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29438) Failed to get state store in stream-stream join

2020-01-30 Thread Tathagata Das (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-29438.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26162
[https://github.com/apache/spark/pull/26162]

> Failed to get state store in stream-stream join
> ---
>
> Key: SPARK-29438
> URL: https://issues.apache.org/jira/browse/SPARK-29438
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Genmao Yu
>Assignee: Jungtaek Lim
>Priority: Critical
> Fix For: 3.0.0
>
>
> Now, Spark use the `TaskPartitionId` to determine the StateStore path.
> {code:java}
> TaskPartitionId   \ 
> StateStoreVersion  --> StoreProviderId -> StateStore
> StateStoreName/  
> {code}
> In spark stages, the task partition id is determined by the number of tasks. 
> As we said the StateStore file path depends on the task partition id. So if 
> stream-stream join task partition id is changed against last batch, it will 
> get wrong StateStore data or fail with non-exist StateStore data. In some 
> corner cases, it happened. Following is a sample pseudocode:
> {code:java}
> val df3 = streamDf1.join(streamDf2)
> val df5 = streamDf3.join(batchDf4)
> val df = df3.union(df5)
> df.writeStream...start()
> {code}
> A simplified DAG like this:
> {code:java}
> DataSourceV2Scan   Scan Relation DataSourceV2Scan   DataSourceV2Scan
>  (streamDf3)|   (streamDf1)(streamDf2)
>  |  |   | |
>   Exchange(200)  Exchange(200)   Exchange(200) Exchange(200)
>  |  |   | | 
>SortSort | |
>  \  /   \ /
>   \/ \   /
> SortMergeJoinStreamingSymmetricHashJoin
>  \ /
>\ /
>  \ /
> Union
> {code}
> Stream-Steam join task Id will start from 200 to 399 as they are in the same 
> stage with `SortMergeJoin`. But when there is no new incoming data in 
> `streamDf3` in some batch, it will generate a empty LocalRelation, and then 
> the SortMergeJoin will be replaced with a BroadcastHashJoin. In this case, 
> Stream-Steam join task Id will start from 1 to 200. Finally, it will get 
> wrong StateStore path through TaskPartitionId, and failed with error reading 
> state store delta file.
> {code:java}
> LocalTableScan   Scan Relation DataSourceV2Scan   DataSourceV2Scan
>  |  |   | |
> BroadcastExchange   |  Exchange(200) Exchange(200)
>  |  |   | | 
>  |  |   | |
>   \/ \   /
>\ /\ /
>   BroadcastHashJoin StreamingSymmetricHashJoin
>  \ /
>\ /
>  \ /
> Union
> {code}
> In my job, I closed the auto BroadcastJoin feature (set 
> spark.sql.autoBroadcastJoinThreshold=-1) to walk around this bug. We should 
> make the StateStore path determinate but not depends on TaskPartitionId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29438) Failed to get state store in stream-stream join

2020-01-30 Thread Tathagata Das (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-29438:
-

Assignee: Jungtaek Lim

> Failed to get state store in stream-stream join
> ---
>
> Key: SPARK-29438
> URL: https://issues.apache.org/jira/browse/SPARK-29438
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Genmao Yu
>Assignee: Jungtaek Lim
>Priority: Critical
>
> Now, Spark use the `TaskPartitionId` to determine the StateStore path.
> {code:java}
> TaskPartitionId   \ 
> StateStoreVersion  --> StoreProviderId -> StateStore
> StateStoreName/  
> {code}
> In spark stages, the task partition id is determined by the number of tasks. 
> As we said the StateStore file path depends on the task partition id. So if 
> stream-stream join task partition id is changed against last batch, it will 
> get wrong StateStore data or fail with non-exist StateStore data. In some 
> corner cases, it happened. Following is a sample pseudocode:
> {code:java}
> val df3 = streamDf1.join(streamDf2)
> val df5 = streamDf3.join(batchDf4)
> val df = df3.union(df5)
> df.writeStream...start()
> {code}
> A simplified DAG like this:
> {code:java}
> DataSourceV2Scan   Scan Relation DataSourceV2Scan   DataSourceV2Scan
>  (streamDf3)|   (streamDf1)(streamDf2)
>  |  |   | |
>   Exchange(200)  Exchange(200)   Exchange(200) Exchange(200)
>  |  |   | | 
>SortSort | |
>  \  /   \ /
>   \/ \   /
> SortMergeJoinStreamingSymmetricHashJoin
>  \ /
>\ /
>  \ /
> Union
> {code}
> Stream-Steam join task Id will start from 200 to 399 as they are in the same 
> stage with `SortMergeJoin`. But when there is no new incoming data in 
> `streamDf3` in some batch, it will generate a empty LocalRelation, and then 
> the SortMergeJoin will be replaced with a BroadcastHashJoin. In this case, 
> Stream-Steam join task Id will start from 1 to 200. Finally, it will get 
> wrong StateStore path through TaskPartitionId, and failed with error reading 
> state store delta file.
> {code:java}
> LocalTableScan   Scan Relation DataSourceV2Scan   DataSourceV2Scan
>  |  |   | |
> BroadcastExchange   |  Exchange(200) Exchange(200)
>  |  |   | | 
>  |  |   | |
>   \/ \   /
>\ /\ /
>   BroadcastHashJoin StreamingSymmetricHashJoin
>  \ /
>\ /
>  \ /
> Union
> {code}
> In my job, I closed the auto BroadcastJoin feature (set 
> spark.sql.autoBroadcastJoinThreshold=-1) to walk around this bug. We should 
> make the StateStore path determinate but not depends on TaskPartitionId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30656) Support the "minPartitions" option in Kafka batch source and streaming source v1

2020-01-30 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-30656.
--
Fix Version/s: 3.0.0
   Resolution: Done

> Support the "minPartitions" option in Kafka batch source and streaming source 
> v1
> 
>
> Key: SPARK-30656
> URL: https://issues.apache.org/jira/browse/SPARK-30656
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now, the "minPartitions" option only works in Kafka streaming source 
> v2. It would be great that we can support it in batch and streaming source v1 
> (v1 is the fallback mode when a user hits a regression in v2) as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29890:
--
Fix Version/s: 2.4.5

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-01-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30689:
--
Description: 
Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
supports custom resource scheduling for things like GPUs soon and have 
requested support for it in older hadoop 2.x versions. This also means that 
they may not have isolation enabled which is what the default behavior relies 
on.

right now the option is to write a custom discovery script to handle on their 
own. This is ok but has some limitation because the script runs as a separate 
process.  It also just a shell script.

I think we can make this a lot more flexible by making the entire resource 
discovery class pluggable. The default one would stay as is and call the 
discovery script, but if an advanced user wanted to replace the entire thing 
they could implement a pluggable class which they could write custom code on 
how to discovery resource addresses.

This will also help users if they are running hadoop 3.1.x or greater but don't 
have the resources configured or aren't running in an isolated environment.

  was:
Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
supports custom resource scheduling for things like GPUs soon and have 
requested support for it in older hadoop 2.x versions. This also means that 
they may not have isolation enabled which is what the default behavior relies 
on.

right now the option is to write a custom discovery script to handle on their 
own. This is ok but has some limitation because the script runs as a separate 
process.  It also just a shell script.

I think we can make this a lot more flexible by making the entire resource 
discovery class pluggable. The default one would stay as is and call the 
discovery script, but if an advanced user wanted to replace the entire thing 
they could implement a pluggable class which they could write custom code on 
how to discovery resource addresses.


> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
> supports custom resource scheduling for things like GPUs soon and have 
> requested support for it in older hadoop 2.x versions. This also means that 
> they may not have isolation enabled which is what the default behavior relies 
> on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.
> This will also help users if they are running hadoop 3.1.x or greater but 
> don't have the resources configured or aren't running in an isolated 
> environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-01-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30689:
--
Description: 
Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
supports custom resource scheduling for things like GPUs soon and have 
requested support for it in older hadoop 2.x versions. This also means that 
they may not have isolation enabled which is what the default behavior relies 
on.

right now the option is to write a custom discovery script to handle on their 
own. This is ok but has some limitation because the script runs as a separate 
process.  It also just a shell script.

I think we can make this a lot more flexible by making the entire resource 
discovery class pluggable. The default one would stay as is and call the 
discovery script, but if an advanced user wanted to replace the entire thing 
they could implement a pluggable class which they could write custom code on 
how to discovery resource addresses.

  was:
Many people/companies will not be moving to Hadoop 3.0 that it supports custom 
resource scheduling for things like GPUs soon and have requested support for it 
in older hadoop 2.x versions. This also means that they may not have isolation 
enabled which is what the default behavior relies on.

right now the option is to write a custom discovery script to handle on their 
own. This is ok but has some limitation because the script runs as a separate 
process.  It also just a shell script.

I think we can make this a lot more flexible by making the entire resource 
discovery class pluggable. The default one would stay as is and call the 
discovery script, but if an advanced user wanted to replace the entire thing 
they could implement a pluggable class which they could write custom code on 
how to discovery resource addresses.


> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
> supports custom resource scheduling for things like GPUs soon and have 
> requested support for it in older hadoop 2.x versions. This also means that 
> they may not have isolation enabled which is what the default behavior relies 
> on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-01-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30689:
--
Issue Type: Improvement  (was: Bug)

> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Many people/companies will not be moving to Hadoop 3.1 or greater, where it 
> supports custom resource scheduling for things like GPUs soon and have 
> requested support for it in older hadoop 2.x versions. This also means that 
> they may not have isolation enabled which is what the default behavior relies 
> on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.
> This will also help users if they are running hadoop 3.1.x or greater but 
> don't have the resources configured or aren't running in an isolated 
> environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30689) Allow custom resource scheduling to work with Hadoop versions that don't support it

2020-01-30 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-30689:
-

 Summary: Allow custom resource scheduling to work with Hadoop 
versions that don't support it
 Key: SPARK-30689
 URL: https://issues.apache.org/jira/browse/SPARK-30689
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.0.0
Reporter: Thomas Graves


Many people/companies will not be moving to Hadoop 3.0 that it supports custom 
resource scheduling for things like GPUs soon and have requested support for it 
in older hadoop 2.x versions. This also means that they may not have isolation 
enabled which is what the default behavior relies on.

right now the option is to write a custom discovery script to handle on their 
own. This is ok but has some limitation because the script runs as a separate 
process.  It also just a shell script.

I think we can make this a lot more flexible by making the entire resource 
discovery class pluggable. The default one would stay as is and call the 
discovery script, but if an advanced user wanted to replace the entire thing 
they could implement a pluggable class which they could write custom code on 
how to discovery resource addresses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling

2020-01-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30689:
--
Summary: Allow custom resource scheduling to work with YARN versions that 
don't support custom resource scheduling  (was: Allow custom resource 
scheduling to work with Hadoop versions that don't support it)

> Allow custom resource scheduling to work with YARN versions that don't 
> support custom resource scheduling
> -
>
> Key: SPARK-30689
> URL: https://issues.apache.org/jira/browse/SPARK-30689
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Many people/companies will not be moving to Hadoop 3.0 that it supports 
> custom resource scheduling for things like GPUs soon and have requested 
> support for it in older hadoop 2.x versions. This also means that they may 
> not have isolation enabled which is what the default behavior relies on.
> right now the option is to write a custom discovery script to handle on their 
> own. This is ok but has some limitation because the script runs as a separate 
> process.  It also just a shell script.
> I think we can make this a lot more flexible by making the entire resource 
> discovery class pluggable. The default one would stay as is and call the 
> discovery script, but if an advanced user wanted to replace the entire thing 
> they could implement a pluggable class which they could write custom code on 
> how to discovery resource addresses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-01-30 Thread Rajkumar Singh (Jira)

Rajkumar Singh created SPARK-30688:
--

 Summary: Spark SQL Unix Timestamp produces incorrect result with 
unix_timestamp UDF
 Key: SPARK-30688
 URL: https://issues.apache.org/jira/browse/SPARK-30688
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Rajkumar Singh


 
{code:java}
scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
|                         null|
+-+
 
scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
-+
|unix_timestamp(20202, ww)|
+-+
|                   1578182400|
+-+
 
{code}
 

 

This seems to happen for leap year only, I dig deeper into it and it seems that 
 Spark is using the java.text.SimpleDateFormat and try to parse the expression 
here

[org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
{code:java}
formatter.parse(
 t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
 but fail and SimpleDateFormat unable to parse the date throw Unparseable 
Exception but Spark handle it silently and returns NULL.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row

2020-01-30 Thread Bao Nguyen (Jira)

Bao Nguyen created SPARK-30687:
--

 Summary: When reading from a file with pre-defined schema and 
encountering a single value that is not the same type as that of its column , 
Spark nullifies the entire row
 Key: SPARK-30687
 URL: https://issues.apache.org/jira/browse/SPARK-30687
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bao Nguyen


When reading from a file with pre-defined schema and encountering a single 
value that is not the same type as that of its column , Spark nullifies the 
entire row instead of setting the value at that cell to be null.

 
{code:java}
case class TestModel(
  num: Double, test: String, mac: String, value: Double

)
val schema = 
ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType]
//here's the content of the file test.data
//1~test~mac1~2
//1.0~testdatarow2~mac2~non-numeric
//2~test1~mac1~3
val ds = spark
  .read
  .schema(schema)
  .option("delimiter", "~")
  .csv("/test-data/test.data")
ds.show();
//the content of data frame. second row is all null. 
//  ++-++-+
//  | num| test| mac|value|
//  ++-++-+
//  | 1.0| test|mac1|  2.0|
//  |null| null|null| null|
//  | 2.0|test1|mac1|  3.0|
//  ++-++-+

//should be

// ++--++-+ 
// | num| test | mac|value| 
// ++--++-+ 
// | 1.0| test |mac1| 2.0 | 
// |1.0 |testdatarow2  |mac2| null| 
// | 2.0|test1 |mac1| 3.0 | 
// ++--++-+{code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30065:
--
Affects Version/s: 2.0.2

> Unable to drop na with duplicate columns
> 
>
> Key: SPARK-30065
> URL: https://issues.apache.org/jira/browse/SPARK-30065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to drop rows with null values fails even when no columns are 
> specified. This should be allowed:
> {code:java}
> scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
> left: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
> right: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val df = left.join(right, Seq("col1"))
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more 
> field]
> scala> df.show
> ++++
> |col1|col2|col2|
> ++++
> |   1|null|   2|
> |   3|   4|null|
> ++++
> scala> df.na.drop("any")
> org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could 
> be: col2, col2.;
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30065:
--
Affects Version/s: 2.1.3

> Unable to drop na with duplicate columns
> 
>
> Key: SPARK-30065
> URL: https://issues.apache.org/jira/browse/SPARK-30065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to drop rows with null values fails even when no columns are 
> specified. This should be allowed:
> {code:java}
> scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
> left: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
> right: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val df = left.join(right, Seq("col1"))
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more 
> field]
> scala> df.show
> ++++
> |col1|col2|col2|
> ++++
> |   1|null|   2|
> |   3|   4|null|
> ++++
> scala> df.na.drop("any")
> org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could 
> be: col2, col2.;
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30065:
--
Affects Version/s: 2.2.3
   2.3.4
   2.4.4

> Unable to drop na with duplicate columns
> 
>
> Key: SPARK-30065
> URL: https://issues.apache.org/jira/browse/SPARK-30065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to drop rows with null values fails even when no columns are 
> specified. This should be allowed:
> {code:java}
> scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
> left: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
> right: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val df = left.join(right, Seq("col1"))
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more 
> field]
> scala> df.show
> ++++
> |col1|col2|col2|
> ++++
> |   1|null|   2|
> |   3|   4|null|
> ++++
> scala> df.na.drop("any")
> org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could 
> be: col2, col2.;
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29890:
--
Affects Version/s: 2.0.2

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29890:
--
Affects Version/s: 2.1.3

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29890:
--
Affects Version/s: 2.2.3

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.3, 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30065:
--
Issue Type: Bug  (was: Improvement)

> Unable to drop na with duplicate columns
> 
>
> Key: SPARK-30065
> URL: https://issues.apache.org/jira/browse/SPARK-30065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to drop rows with null values fails even when no columns are 
> specified. This should be allowed:
> {code:java}
> scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
> left: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
> right: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> val df = left.join(right, Seq("col1"))
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more 
> field]
> scala> df.show
> ++++
> |col1|col2|col2|
> ++++
> |   1|null|   2|
> |   3|   4|null|
> ++++
> scala> df.na.drop("any")
> org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could 
> be: col2, col2.;
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30638) add resources as parameter to the PluginContext

2020-01-30 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30638:
--
Description: Add the allocates resources to parameters to the PluginContext 
so that any plugins in driver or executor could use this information to 
initialize devices or use this information in a useful manner.  (was: Add the 
allocates resources and ResourceProfile to parameters to the PluginContext so 
that any plugins in driver or executor could use this information to initialize 
devices or use this information in a useful manner.)

> add resources as parameter to the PluginContext
> ---
>
> Key: SPARK-30638
> URL: https://issues.apache.org/jira/browse/SPARK-30638
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Add the allocates resources to parameters to the PluginContext so that any 
> plugins in driver or executor could use this information to initialize 
> devices or use this information in a useful manner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30673) Test cases in HiveShowCreateTableSuite should create Hive table instead of Datasource table

2020-01-30 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-30673.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27393
[https://github.com/apache/spark/pull/27393]

> Test cases in HiveShowCreateTableSuite should create Hive table instead of 
> Datasource table
> ---
>
> Key: SPARK-30673
> URL: https://issues.apache.org/jira/browse/SPARK-30673
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
>
> Because SparkSQL now creates data source table if no provider is specified in 
> SQL command, some test cases in HiveShowCreateTableSuite don't create Hive 
> table, but data source table.
> It is confusing and not good for the purpose of this test suite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21823) ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.

2020-01-30 Thread Sunitha Kambhampati (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026944#comment-17026944
 ] 

Sunitha Kambhampati commented on SPARK-21823:
-

This issue is as part of the new Informational RI support related changes that 
will be required to the 2 existing DDLs as mentioned in the description.  This 
is dependent on SPARK-21784 which adds the initial ddl support to add RI 
constraints. It was waiting on review from community. 

> ALTER TABLE table statements  such as RENAME and CHANGE columns should  raise 
>  error if there are any dependent constraints. 
> -
>
> Key: SPARK-21823
> URL: https://issues.apache.org/jira/browse/SPARK-21823
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Following ALTER TABLE DDL statements will impact  the  informational 
> constraints defined on a table:
> {code:sql}
> ALTER TABLE name RENAME TO new_name
> ALTER TABLE name CHANGE column_name new_name new_type
> {code}
> Spark SQL should raise errors if there are   informational constraints 
> defined on the columns  affected by the ALTER  and let the user drop 
> constraints before proceeding with the DDL. In the future we can enhance the  
> ALTER  to automatically fix up the constraint definition in the catalog when 
> possible, and not raise error
> When spark adds support for DROP/REPLACE of columns they will impact 
> informational constraints.
> {code:sql}
> ALTER TABLE name DROP [COLUMN] column_name
> ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30662) ALS/MLP extend HasBlockSize

2020-01-30 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30662:


Assignee: Huaxin Gao

> ALS/MLP extend HasBlockSize
> ---
>
> Key: SPARK-30662
> URL: https://issues.apache.org/jira/browse/SPARK-30662
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30662) ALS/MLP extend HasBlockSize

2020-01-30 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30662.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27389
[https://github.com/apache/spark/pull/27389]

> ALS/MLP extend HasBlockSize
> ---
>
> Key: SPARK-30662
> URL: https://issues.apache.org/jira/browse/SPARK-30662
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-01-30 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026936#comment-17026936
 ] 

Sean R. Owen commented on SPARK-29292:
--

Right, I agree. I can try updating my fork and pushing the commits to a branch, 
for evaluation in _2.12_ at least. I just wasn't bothering until it seemed like 
2.13 support blockers were removed, to do it once. Happy to do it though if you 
want to play with it and/or just see the scope of the change.

> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-01-30 Thread Behroz Sikander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026929#comment-17026929
 ] 

Behroz Sikander commented on SPARK-30686:
-

1- I don't know the exact steps, i tried to reproduce but I couldn't figure out.

2- I don't understand the question. If you are asking about the response of 
/api/v1/version endpoint, then yes, it is also returning this.

 

> Spark 2.4.4 metrics endpoint throwing error
> ---
>
> Key: SPARK-30686
> URL: https://issues.apache.org/jira/browse/SPARK-30686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Behroz Sikander
>Priority: Major
>
> I am using Spark-standalone in HA mode with zookeeper.
> Once the driver is up and running, whenever I try to access the metrics api 
> using the following URL
> http://master_address/proxy/app-20200130041234-0123/api/v1/applications
> I get the following exception.
> It seems that the request never even reaches the spark code. It would be 
> helpful if somebody can help me.
> {code:java}
> HTTP ERROR 500
> Problem accessing /api/v1/applications. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException: while trying to invoke the method 
> org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, 
> javax.servlet.http.HttpServletRequest, 
> javax.servlet.http.HttpServletResponse) of a null object loaded from field 
> org.glassfish.jersey.servlet.ServletContainer.webComponent of an object 
> loaded from local variable 'this'
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:808)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13

2020-01-30 Thread Kousuke Saruta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026914#comment-17026914
 ] 

Kousuke Saruta commented on SPARK-29292:


I was about to try to work on this so I've confirmed.
I also think .toSeq affects few. If mutable.Seq.toSeq is called, references are 
copied but it should be small effect.
But running some benchmarks might be better.


> Fix internal usages of mutable collection for Seq in 2.13
> -
>
> Key: SPARK-29292
> URL: https://issues.apache.org/jira/browse/SPARK-29292
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a 
> simpler subset. 
> In 2.13, a mutable collection can't be returned as a 
> {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that 
> still works on 2.12.
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467:
>  type mismatch;
>  found   : Seq[String] (in scala.collection) 
>  required: Seq[String] (in scala.collection.immutable) 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30680) ResolvedNamespace does not require a namespace catalog

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30680.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27403
[https://github.com/apache/spark/pull/27403]

> ResolvedNamespace does not require a namespace catalog
> --
>
> Key: SPARK-30680
> URL: https://issues.apache.org/jira/browse/SPARK-30680
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30622) commands should return dummy statistics

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30622.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27344
[https://github.com/apache/spark/pull/27344]

> commands should return dummy statistics
> ---
>
> Key: SPARK-30622
> URL: https://issues.apache.org/jira/browse/SPARK-30622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29639:
--
Target Version/s:   (was: 2.4.4, 2.4.5)

> Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
> 
>
> Key: SPARK-29639
> URL: https://issues.apache.org/jira/browse/SPARK-29639
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Abhinav Choudhury
>Priority: Major
>
> We have been running a Spark structured job on production for more than a 
> week now. Put simply, it reads data from source Kafka topics (with 4 
> partitions) and writes to another kafka topic. Everything has been running 
> fine until the job started failing with the following error:
>  
> {noformat}
> Driver stacktrace:
>  === Streaming Query ===
>  Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId 
> = 613a21ad-86e3-4781-891b-17d92c18954a]
>  Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
> {"kafka-topic-name":
> {"2":10458347,"1":10460151,"3":10475678,"0":9809564}
> }}
>  Current Available Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: 
> {"kafka-topic-name":
> {"2":10458347,"1":10460151,"3":10475678,"0":10509527}
> }}
> Current State: ACTIVE
>  Thread State: RUNNABLE
> <-- Removed Logical plan -->
>  Some data may have been lost because they are not available in Kafka any 
> more; either the
>  data was aged out by Kafka or the topic may have been deleted before all the 
> data in the
>  topic was processed. If you don't want your streaming query to fail on such 
> cases, set the
>  source option "failOnDataLoss" to "false".{noformat}
> Configuration:
> {noformat}
> Spark 2.4.0
> Spark-sql-kafka 0.10{noformat}
> Looking at the Spark structured streaming query progress logs, it seems like 
> the endOffsets computed for the next batch was actually smaller than the 
> starting offset:
> *Microbatch Trigger 1:*
> {noformat}
> 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : 
> Query {
>   "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
>   "runId" : "2d20d633-2768-446c-845b-893243361422",
>   "name" : "StreamingProcessorName",
>   "timestamp" : "2019-10-26T23:53:51.741Z",
>   "batchId" : 2145898,
>   "numInputRows" : 0,
>   "inputRowsPerSecond" : 0.0,
>   "processedRowsPerSecond" : 0.0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 9,
> "triggerExecution" : 9
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[kafka-topic-name]]",
> "startOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "endOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "numInputRows" : 0,
> "inputRowsPerSecond" : 0.0,
> "processedRowsPerSecond" : 0.0
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink"
>   }
> } in progress{noformat}
> *Next micro batch trigger:*
> {noformat}
> 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : 
> Query {
>   "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b",
>   "runId" : "2d20d633-2768-446c-845b-893243361422",
>   "name" : "StreamingProcessorName",
>   "timestamp" : "2019-10-26T23:53:52.907Z",
>   "batchId" : 2145898,
>   "numInputRows" : 0,
>   "inputRowsPerSecond" : 0.0,
>   "processedRowsPerSecond" : 0.0,
>   "durationMs" : {
> "addBatch" : 350,
> "getBatch" : 4,
> "getEndOffset" : 0,
> "queryPlanning" : 102,
> "setOffsetRange" : 24,
> "triggerExecution" : 1043,
> "walCommit" : 349
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[kafka-topic-name]]",
> "startOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 10469196,
> "0" : 10503762
>   }
> },
> "endOffset" : {
>   "kafka-topic-name" : {
> "2" : 10452513,
> "1" : 10454326,
> "3" : 9773098,
> "0" : 10503762
>   }
> },
> "numInputRows" : 0,
> "inputRowsPerSecond" : 0.0,
> "processedRowsPerSecond" : 0.0
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink"
>   }
> } in progress{noformat}
> Notice that for partition 3 of the kafka topic, the endOffsets are actually 
> smaller than the starting offsets!
> Checked the HDFS checkpoint dir and the checkpointed offsets look fine and 
> point to the last committed offsets
>  Why is the end offset for a partition being computed to a smaller v

[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-01-30 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026886#comment-17026886
 ] 

Dongjoon Hyun commented on SPARK-30686:
---

Hi, [~bsikander]. Thank you for reporting.
 # Could you give us a reproducible step?
 # Did you see this in the Spark versions, too?

> Spark 2.4.4 metrics endpoint throwing error
> ---
>
> Key: SPARK-30686
> URL: https://issues.apache.org/jira/browse/SPARK-30686
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Behroz Sikander
>Priority: Major
>
> I am using Spark-standalone in HA mode with zookeeper.
> Once the driver is up and running, whenever I try to access the metrics api 
> using the following URL
> http://master_address/proxy/app-20200130041234-0123/api/v1/applications
> I get the following exception.
> It seems that the request never even reaches the spark code. It would be 
> helpful if somebody can help me.
> {code:java}
> HTTP ERROR 500
> Problem accessing /api/v1/applications. Reason:
> Server Error
> Caused by:
> java.lang.NullPointerException: while trying to invoke the method 
> org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, 
> javax.servlet.http.HttpServletRequest, 
> javax.servlet.http.HttpServletResponse) of a null object loaded from field 
> org.glassfish.jersey.servlet.ServletContainer.webComponent of an object 
> loaded from local variable 'this'
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:808)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30685) Support ANSI INSERT syntax

2020-01-30 Thread Chris Knoll (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Knoll updated SPARK-30685:

Issue Type: New Feature  (was: Bug)

> Support ANSI INSERT syntax
> --
>
> Key: SPARK-30685
> URL: https://issues.apache.org/jira/browse/SPARK-30685
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chris Knoll
>Priority: Minor
>
> Related to the [ANSI SQL specification for insert 
> syntax](https://en.wikipedia.org/wiki/Insert_(SQL)), could the parsing and 
> underlying engine support the syntax of:
> {{INSERT INTO  () select  }}
> I think I read somewhere that there's some underlying technical detail where 
> the columns inserted into SPARK tables must have the selected columns match 
> the order of the table definition.  But, if this is the case, isn't' there a 
> place in the parser-layer and execution-layer where the parser can translate 
> something like:
> {{insert into someTable (col1,col2)
> select someCol1, someCol2 from otherTable}}
> Where someTable has 3 columns (col3,col2,col1) (note the order here), the 
> query is rewritten and sent to the engine as:
> {{insert into someTable
> select null, someCol2, someCol1 from otherTable}}
> Note, the reordering and adding of the null column was done based on some 
> table metadata on someTable so it knew which columns from the INSERT() map 
> over to the columns from the select.
> Is this possible? The lack of specifying column values is preventing our 
> project from SPARK being a supported platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30678:
-

Assignee: Maxim Gekk

> Eliminate warnings from deprecated BisectingKMeansModel.computeCost
> ---
>
> Key: SPARK-30678
> URL: https://issues.apache.org/jira/browse/SPARK-30678
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The computeCost() method of the BisectingKMeansModel class has been 
> deprecated. That causes warnings while compiling BisectingKMeansSuite:
> {code}
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn] assert(model.computeCost(dataset) < 0.1)
> [warn]  ^
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn] assert(model.computeCost(dataset) == summary.trainingCost)
> [warn]  ^
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn]   model.computeCost(dataset)
> [warn] ^
> [warn] three warnings found
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30678.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27401
[https://github.com/apache/spark/pull/27401]

> Eliminate warnings from deprecated BisectingKMeansModel.computeCost
> ---
>
> Key: SPARK-30678
> URL: https://issues.apache.org/jira/browse/SPARK-30678
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The computeCost() method of the BisectingKMeansModel class has been 
> deprecated. That causes warnings while compiling BisectingKMeansSuite:
> {code}
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn] assert(model.computeCost(dataset) < 0.1)
> [warn]  ^
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn] assert(model.computeCost(dataset) == summary.trainingCost)
> [warn]  ^
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn]   model.computeCost(dataset)
> [warn] ^
> [warn] three warnings found
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30659) LogisticRegression blockify input vectors

2020-01-30 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30659.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27374
[https://github.com/apache/spark/pull/27374]

> LogisticRegression blockify input vectors
> -
>
> Key: SPARK-30659
> URL: https://issues.apache.org/jira/browse/SPARK-30659
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-01-30 Thread Behroz Sikander (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026805#comment-17026805
 ] 

Behroz Sikander commented on SPARK-30686:
-

Sometimes, I have also seen


{code:java}
Caused by:
java.lang.NoClassDefFoundError: 
org/glassfish/jersey/internal/inject/AbstractBinder
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:863)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:529)
at java.net.URLClassLoader.access$100(URLClassLoader.java:75)
at java.net.URLClassLoader$1.run(URLClassLoader.java:430)
at java.net.URLClassLoader$1.run(URLClassLoader.java:424)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:423)
at java.lang.ClassLoader.loadClass(ClassLoader.java:490)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at 
org.glassfish.jersey.media.sse.SseFeature.configure(SseFeature.java:123)
at 
org.glassfish.jersey.model.internal.CommonConfig.configureFeatures(CommonConfig.java:730)
at 
org.glassfish.jersey.model.internal.CommonConfig.configureMetaProviders(CommonConfig.java:648)
at 
org.glassfish.jersey.server.ResourceConfig.configureMetaProviders(ResourceConfig.java:829)
at 
org.glassfish.jersey.server.ApplicationHandler.initialize(ApplicationHandler.java:453)
at 
org.glassfish.jersey.server.ApplicationHandler.access$500(ApplicationHandler.java:184)
at 
org.glassfish.jersey.server.ApplicationHandler$3.call(ApplicationHandler.java:350)
at 
org.glassfish.jersey.server.ApplicationHandler$3.call(ApplicationHandler.java:347)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at 
org.glassfish.jersey.internal.Errors.processWithException(Errors.java:255)
at 
org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:347)
at 
org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:392)
at 
org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177)
at 
org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at 
org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:643)
at 
org.spark_project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:499)
at 
org.spark_project.jetty.servlet.ServletHolder.ensureInstance(ServletHolder.java:791)
at 
org.spark_project.jetty.servlet.ServletHolder.prepare(ServletHolder.java:776)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:579)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at 
org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:808)
Caused by: java.lang.ClassNotFoundException: 
org.glassfish.jersey.internal.

[jira] [Created] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error

2020-01-30 Thread Behroz Sikander (Jira)

Behroz Sikander created SPARK-30686:
---

 Summary: Spark 2.4.4 metrics endpoint throwing error
 Key: SPARK-30686
 URL: https://issues.apache.org/jira/browse/SPARK-30686
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Behroz Sikander


I am using Spark-standalone in HA mode with zookeeper.

Once the driver is up and running, whenever I try to access the metrics api 
using the following URL
http://master_address/proxy/app-20200130041234-0123/api/v1/applications
I get the following exception.

It seems that the request never even reaches the spark code. It would be 
helpful if somebody can help me.

{code:java}
HTTP ERROR 500
Problem accessing /api/v1/applications. Reason:

Server Error
Caused by:
java.lang.NullPointerException: while trying to invoke the method 
org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, 
javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) 
of a null object loaded from field 
org.glassfish.jersey.servlet.ServletContainer.webComponent of an object loaded 
from local variable 'this'
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at 
org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:808)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30685) Support ANSI INSERT syntax

2020-01-30 Thread Chris Knoll (Jira)

Chris Knoll created SPARK-30685:
---

 Summary: Support ANSI INSERT syntax
 Key: SPARK-30685
 URL: https://issues.apache.org/jira/browse/SPARK-30685
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Chris Knoll


Related to the [ANSI SQL specification for insert 
syntax](https://en.wikipedia.org/wiki/Insert_(SQL)), could the parsing and 
underlying engine support the syntax of:
{{INSERT INTO  () select }}

I think I read somewhere that there's some underlying technical detail where 
the columns inserted into SPARK tables must have the selected columns match the 
order of the table definition.  But, if this is the case, isn't' there a place 
in the parser-layer and execution-layer where the parser can translate 
something like:


{{insert into someTable (col1,col2)
select someCol1, someCol2 from otherTable}}


Where someTable has 3 columns (col3,col2,col1) (note the order here), the query 
is rewritten and sent to the engine as:

{{insert into someTable
select null, someCol2, someCol1 from otherTable}}

Note, the reordering and adding of the null column was done based on some table 
metadata on someTable so it knew which columns from the INSERT() map over to 
the columns from the select.

Is this possible? The lack of specifying column values is preventing our 
project from SPARK being a supported platform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30684) Show the descripton of metrics for WholeStageCodegen in DAG viz

2020-01-30 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-30684:
--

 Summary: Show the descripton of metrics for WholeStageCodegen in 
DAG viz
 Key: SPARK-30684
 URL: https://issues.apache.org/jira/browse/SPARK-30684
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the DAG viz for SQL, A WholeStageCodegen-node shows some metrics like `33 ms 
(1 ms, 1 ms, 26 ms (stage 16 (attempt 0): task 172))` but users can't 
understand what they mean because there are no description about those metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation

2020-01-30 Thread Loki Coyote (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026785#comment-17026785
 ] 

Loki Coyote commented on SPARK-30683:
-

This command seems like it would work for conversion and validation on most 
linux distros:
{code:bash}
cat spark-2.4.4-bin-hadoop2.7.tgz.sha512 | tr '\n' ' '| tr -d ' ' | awk -F ':' 
'{ print $2 "\t" $1 }'| sha512sum -c -
{code}

> Validating sha512 files for releases does not comply with Apache Project 
> documentation
> --
>
> Key: SPARK-30683
> URL: https://issues.apache.org/jira/browse/SPARK-30683
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0, 2.4.0, 2.4.4
>Reporter: Loki Coyote
>Priority: Minor
>
> Spark release artifacts currently use gpg generated sha512 files.  These are 
> not in the same format as ones generated via sha512sum, and can not be 
> validated easily using sha512sum.  As the [Apache Project 
> instructions|https://www.apache.org/info/verification.html] for how to 
> validate a release explicitly instructs the use of sha512sum compatible 
> tools, there is no documentation for successfully validating a downloaded 
> release using the sha512 file.
>  
> This subject has been [brought up 
> before|https://www.mail-archive.com/dev@spark.apache.org/msg20140.html] and 
> an issue to change to using sha512sum compatible formats was closed as 
> WontFix with little discussion; even if switching isn't warranted, at the 
> least instructions for validating using the sha512 files should be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation

2020-01-30 Thread Nick Jacobsen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Jacobsen updated SPARK-30683:
--
Description: 
Spark release artifacts currently use gpg generated sha512 files.  These are 
not in the same format as ones generated via sha512sum, and can not be 
validated easily using sha512sum.  As the [Apache Project 
instructions|https://www.apache.org/info/verification.html] for how to validate 
a release explicitly instructs the use of sha512sum compatible tools, there is 
no documentation for successfully validating a downloaded release using the 
sha512 file.

 

This subject has been [brought up 
before|https://www.mail-archive.com/dev@spark.apache.org/msg20140.html] and an 
issue to change to using sha512sum compatible formats was closed as WontFix 
with little discussion; even if switching isn't warranted, at the least 
instructions for validating using the sha512 files should be documented.

  was:
Spark release artifacts currently use gpg generated sha512 files.  These are 
not in the same format as ones generated via sha512sum, and can not be 
validated easily using sha512sum.  As the [Apache Project 
instructions|[https://www.apache.org/info/verification.html]] for how to 
validate a release explicitly instructs the use of sha512sum compatible tools, 
there is no documentation for successfully validating a downloaded release 
using the sha512 file.

 

This subject has been [brought up 
before|[https://www.mail-archive.com/dev@spark.apache.org/msg20140.html]] and 
an issue to change to using sha512sum compatible formats was closed as WontFix 
with little discussion; even if switching isn't warranted, at the least 
instructions for validating using the sha512 files should be documented.


> Validating sha512 files for releases does not comply with Apache Project 
> documentation
> --
>
> Key: SPARK-30683
> URL: https://issues.apache.org/jira/browse/SPARK-30683
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0, 2.4.0, 2.4.4
>Reporter: Nick Jacobsen
>Priority: Minor
>
> Spark release artifacts currently use gpg generated sha512 files.  These are 
> not in the same format as ones generated via sha512sum, and can not be 
> validated easily using sha512sum.  As the [Apache Project 
> instructions|https://www.apache.org/info/verification.html] for how to 
> validate a release explicitly instructs the use of sha512sum compatible 
> tools, there is no documentation for successfully validating a downloaded 
> release using the sha512 file.
>  
> This subject has been [brought up 
> before|https://www.mail-archive.com/dev@spark.apache.org/msg20140.html] and 
> an issue to change to using sha512sum compatible formats was closed as 
> WontFix with little discussion; even if switching isn't warranted, at the 
> least instructions for validating using the sha512 files should be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation

2020-01-30 Thread Nick Jacobsen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Jacobsen updated SPARK-30683:
--
Description: 
Spark release artifacts currently use gpg generated sha512 files.  These are 
not in the same format as ones generated via sha512sum, and can not be 
validated easily using sha512sum.  As the [Apache Project 
instructions|[https://www.apache.org/info/verification.html]] for how to 
validate a release explicitly instructs the use of sha512sum compatible tools, 
there is no documentation for successfully validating a downloaded release 
using the sha512 file.

 

This subject has been [brought up 
before|[https://www.mail-archive.com/dev@spark.apache.org/msg20140.html]] and 
an issue to change to using sha512sum compatible formats was closed as WontFix 
with little discussion; even if switching isn't warranted, at the least 
instructions for validating using the sha512 files should be documented.

  was:
Spark release artifacts currently use gpg generated sha512 files.  These are 
not in the same format as ones generated via sha512sum, and can not be 
validated easily using sha512sum.  As the [Apache Project 
instructions|[https://www.apache.org/info/verification.html]] for how to 
validate a release explicitly instructs the use of sha512sum compatible tools, 
there is no documentation for successfully validating a downloaded release 
using the sha512 file.

 

This subject has been brought up before and an issue to change to using 
sha512sum compatible formats was closed as WontFix with little discussion; even 
if switching isn't warranted, at the least instructions for validating using 
the sha512 files should be documented.


> Validating sha512 files for releases does not comply with Apache Project 
> documentation
> --
>
> Key: SPARK-30683
> URL: https://issues.apache.org/jira/browse/SPARK-30683
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0, 2.4.0, 2.4.4
>Reporter: Nick Jacobsen
>Priority: Minor
>
> Spark release artifacts currently use gpg generated sha512 files.  These are 
> not in the same format as ones generated via sha512sum, and can not be 
> validated easily using sha512sum.  As the [Apache Project 
> instructions|[https://www.apache.org/info/verification.html]] for how to 
> validate a release explicitly instructs the use of sha512sum compatible 
> tools, there is no documentation for successfully validating a downloaded 
> release using the sha512 file.
>  
> This subject has been [brought up 
> before|[https://www.mail-archive.com/dev@spark.apache.org/msg20140.html]] and 
> an issue to change to using sha512sum compatible formats was closed as 
> WontFix with little discussion; even if switching isn't warranted, at the 
> least instructions for validating using the sha512 files should be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation

2020-01-30 Thread Nick Jacobsen (Jira)

Nick Jacobsen created SPARK-30683:
-

 Summary: Validating sha512 files for releases does not comply with 
Apache Project documentation
 Key: SPARK-30683
 URL: https://issues.apache.org/jira/browse/SPARK-30683
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.4.4, 2.4.0, 2.3.0
Reporter: Nick Jacobsen


Spark release artifacts currently use gpg generated sha512 files.  These are 
not in the same format as ones generated via sha512sum, and can not be 
validated easily using sha512sum.  As the [Apache Project 
instructions|[https://www.apache.org/info/verification.html]] for how to 
validate a release explicitly instructs the use of sha512sum compatible tools, 
there is no documentation for successfully validating a downloaded release 
using the sha512 file.

 

This subject has been brought up before and an issue to change to using 
sha512sum compatible formats was closed as WontFix with little discussion; even 
if switching isn't warranted, at the least instructions for validating using 
the sha512 files should be documented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30681) Add higher order functions API to PySpark

2020-01-30 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30681:
---
Description: 
As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
PySpark, forcing Python users to invoke these through {{expr}}, {{selectExpr}} 
or {{sql}}.

This is error prone and not well documented. Spark should provide 
{{pyspark.sql}} wrappers that accept plain Python functions (of course within 
limits of {{(*Column) -> Column}}) as arguments.


{code:python}
df.select(transform("values", lambda c: trim(upper(c)))

def  increment_values(k: Column, v: Column) -> Column:
return v + 1

df.select(transform_values("data"), increment_values)
{code}

  was:
As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
PySpark, forcing Python users to invoke these through {{expr}}, {{selectExpr}} 
or {{sql}}.

This is error prone and not well documented. Spark should provide 
{{pyspark.sql}} wrappers that accept plain Python functions (of course within 
limits of {{(*Column) -> Column}}) as arguments.


> Add higher order functions API to PySpark
> -
>
> Key: SPARK-30681
> URL: https://issues.apache.org/jira/browse/SPARK-30681
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
> PySpark, forcing Python users to invoke these through {{expr}}, 
> {{selectExpr}} or {{sql}}.
> This is error prone and not well documented. Spark should provide 
> {{pyspark.sql}} wrappers that accept plain Python functions (of course within 
> limits of {{(*Column) -> Column}}) as arguments.
> {code:python}
> df.select(transform("values", lambda c: trim(upper(c)))
> def  increment_values(k: Column, v: Column) -> Column:
> return v + 1
> df.select(transform_values("data"), increment_values)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30682) Add higher order functions API to SparkR

2020-01-30 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30682:
--

 Summary: Add higher order functions API to SparkR
 Key: SPARK-30682
 URL: https://issues.apache.org/jira/browse/SPARK-30682
 Project: Spark
  Issue Type: Improvement
  Components: SparkR, SQL
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
SparkR forcing R users to invoke these through {{expr}}, {{selectExpr}} or 
{{sql}}.

It would be great if Spark provided high level wrappers that accept plain R 
functions operating on SQL expressions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30681) Add higher order functions API to PySpark

2020-01-30 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30681:
--

 Summary: Add higher order functions API to PySpark
 Key: SPARK-30681
 URL: https://issues.apache.org/jira/browse/SPARK-30681
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


As of 3.0.0 higher order functions are available in SQL and Scala, but not in 
PySpark, forcing Python users to invoke these through {{expr}}, {{selectExpr}} 
or {{sql}}.

This is error prone and not well documented. Spark should provide 
{{pyspark.sql}} wrappers that accept plain Python functions (of course within 
limits of {{(*Column) -> Column}}) as arguments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30680) ResolvedNamespace does not require a namespace catalog

2020-01-30 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30680:
---

 Summary: ResolvedNamespace does not require a namespace catalog
 Key: SPARK-30680
 URL: https://issues.apache.org/jira/browse/SPARK-30680
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-01-30 Thread Abhishek Rao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026612#comment-17026612
 ] 

Abhishek Rao commented on SPARK-30619:
--

Here are the 2 exceptions

SLF4J:

Error: A JNI error has occurred, please check your installation and try again

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger

    at java.lang.Class.getDeclaredMethods0(Native Method)

    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)

    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)

    at java.lang.Class.getMethod0(Class.java:3018)

    at java.lang.Class.getMethod(Class.java:1784)

    at 
sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)

    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)

Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger

    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)

    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

    ... 7 more

 

Commons.collections

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/commons/collections/map/ReferenceMap
 at 
org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:58)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:302)
 at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:185)
 at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
 at org.apache.spark.SparkContext.(SparkContext.scala:424)
 at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
 at org.apache.spark.examples.JavaWordCount.main(JavaWordCount.java:28)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
 at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:850)
 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
 at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
 at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
 at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:925)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:934)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.collections.map.ReferenceMap
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
 ... 19 more

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30679) REPLACE TABLE can omit the USING clause

2020-01-30 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30679:
---

 Summary: REPLACE TABLE can omit the USING clause
 Key: SPARK-30679
 URL: https://issues.apache.org/jira/browse/SPARK-30679
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30674) Use python3 in dev/lint-python

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30674.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27394
[https://github.com/apache/spark/pull/27394]

> Use python3 in dev/lint-python
> --
>
> Key: SPARK-30674
> URL: https://issues.apache.org/jira/browse/SPARK-30674
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> `lint-python` fails at python2. We had better use python3 explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30674) Use python3 in dev/lint-python

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30674:
-

Assignee: Dongjoon Hyun

> Use python3 in dev/lint-python
> --
>
> Key: SPARK-30674
> URL: https://issues.apache.org/jira/browse/SPARK-30674
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> `lint-python` fails at python2. We had better use python3 explicitly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost

2020-01-30 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30678:
---
Component/s: (was: SQL)
 MLlib

> Eliminate warnings from deprecated BisectingKMeansModel.computeCost
> ---
>
> Key: SPARK-30678
> URL: https://issues.apache.org/jira/browse/SPARK-30678
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The computeCost() method of the BisectingKMeansModel class has been 
> deprecated. That causes warnings while compiling BisectingKMeansSuite:
> {code}
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn] assert(model.computeCost(dataset) < 0.1)
> [warn]  ^
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn] assert(model.computeCost(dataset) == summary.trainingCost)
> [warn]  ^
> [warn] 
> /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323:
>  method computeCost in class BisectingKMeansModel is deprecated (since 
> 3.0.0): This method is deprecated and will be removed in future versions. Use 
> ClusteringEvaluator instead. You can also get the cost on the training 
> dataset in the summary.
> [warn]   model.computeCost(dataset)
> [warn] ^
> [warn] three warnings found
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost

2020-01-30 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30678:
--

 Summary: Eliminate warnings from deprecated 
BisectingKMeansModel.computeCost
 Key: SPARK-30678
 URL: https://issues.apache.org/jira/browse/SPARK-30678
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The computeCost() method of the BisectingKMeansModel class has been deprecated. 
That causes warnings while compiling BisectingKMeansSuite:
{code}
[warn] 
/Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108:
 method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): 
This method is deprecated and will be removed in future versions. Use 
ClusteringEvaluator instead. You can also get the cost on the training dataset 
in the summary.
[warn] assert(model.computeCost(dataset) < 0.1)
[warn]  ^
[warn] 
/Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135:
 method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): 
This method is deprecated and will be removed in future versions. Use 
ClusteringEvaluator instead. You can also get the cost on the training dataset 
in the summary.
[warn] assert(model.computeCost(dataset) == summary.trainingCost)
[warn]  ^
[warn] 
/Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323:
 method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): 
This method is deprecated and will be removed in future versions. Use 
ClusteringEvaluator instead. You can also get the cost on the training dataset 
in the summary.
[warn]   model.computeCost(dataset)
[warn] ^
[warn] three warnings found
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-28897) Invalid usage of '*' in expression 'coalesce' error when executing dataframe.na.fill(0)

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-28897.
-

> Invalid usage of '*' in expression 'coalesce' error when executing 
> dataframe.na.fill(0)
> ---
>
> Key: SPARK-28897
> URL: https://issues.apache.org/jira/browse/SPARK-28897
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saurabh Santhosh
>Priority: Major
>
> Getting the following error when trying to execute the given statements
>  
> {code:java}
> var df = spark.sql(s"select * from default.test_table")
> df.na.fill(0)
> {code}
> This error happens when the following property is set
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> Error :
> {code:java}
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'coalesce';   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1021)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:997)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
>    at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)  
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)  
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)  
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:997)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:982)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:977)
>    at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>    at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>    at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)   
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:977)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:905)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:900)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>    at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOpera

[jira] [Resolved] (SPARK-28897) Invalid usage of '*' in expression 'coalesce' error when executing dataframe.na.fill(0)

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28897.
---
Resolution: Duplicate

> Invalid usage of '*' in expression 'coalesce' error when executing 
> dataframe.na.fill(0)
> ---
>
> Key: SPARK-28897
> URL: https://issues.apache.org/jira/browse/SPARK-28897
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saurabh Santhosh
>Priority: Major
>
> Getting the following error when trying to execute the given statements
>  
> {code:java}
> var df = spark.sql(s"select * from default.test_table")
> df.na.fill(0)
> {code}
> This error happens when the following property is set
> {code:java}
> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
> {code}
> Error :
> {code:java}
> org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 
> 'coalesce';   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1021)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:997)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
>    at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277)  
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>    at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)  
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)  
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:997)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:982)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:977)
>    at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>    at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>    at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)   
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:977)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:905)
>    at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:900)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>    at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>    at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>    at 
> org.apache.spark.sql.catalyst.plans.lo

[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-01-30 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026554#comment-17026554
 ] 

Gabor Somogyi commented on SPARK-12312:
---

[~quarkonium] As an initial step the PR what I'm working on will introduce a 
docker test for postgres to see if later DB versions break something. It could 
be done for all the other DBs to clean this area up...

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30667) Support simple all gather in barrier task context

2020-01-30 Thread Sarth Frey (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026545#comment-17026545
 ] 

Sarth Frey commented on SPARK-30667:


[https://github.com/apache/spark/pull/27395]

> Support simple all gather in barrier task context
> -
>
> Key: SPARK-30667
> URL: https://issues.apache.org/jira/browse/SPARK-30667
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks 
> can see all IP addresses from BarrierTaskContext. It would be simpler to 
> integrate with distributed frameworks like TensorFlow DistributionStrategy if 
> we provide all gather that can let tasks share additional information with 
> others, e.g., an available port.
> Note that with all gather, tasks are share their IP addresses as well.
> {code}
> port = ... # get an available port
> ports = context.all_gather(port) # get all available ports, ordered by task ID
> ...  # set up distributed training service
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29670) Make executor's bindAddress configurable

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29670:
--
Fix Version/s: (was: 3.0.0)

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-29670
> URL: https://issues.apache.org/jira/browse/SPARK-29670
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Nishchal Venkataramana
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-29670) Make executor's bindAddress configurable

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-29670.
-

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-29670
> URL: https://issues.apache.org/jira/browse/SPARK-29670
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Nishchal Venkataramana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24203) Make executor's bindAddress configurable

2020-01-30 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24203:
--
Labels:   (was: bulk-closed)

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Assignee: Nishchal Venkataramana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running

2020-01-30 Thread Mullaivendhan Ariaputhri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30677:
-
Attachment: Cluster-Config-P1.JPG
Instance-Config-P2.JPG
Instance-Config-P1.JPG
Cluster-Config-P2.JPG

> Spark Streaming Job stuck when Kinesis Shard is increased when the job is 
> running
> -
>
> Key: SPARK-30677
> URL: https://issues.apache.org/jira/browse/SPARK-30677
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Mullaivendhan Ariaputhri
>Priority: Major
> Attachments: Cluster-Config-P1.JPG, Cluster-Config-P2.JPG, 
> Instance-Config-P1.JPG, Instance-Config-P2.JPG
>
>
> Spark job stopped processing when the number of shards is increased when the 
> job is already running.
> We have observed the below exceptions.
>  
> 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
> Failed to write to write ahead log
>  2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
> Failed to write to write ahead log
>  2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - 
> Failed to write to write ahead log after 3 failures
>  2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog 
> Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 
> lim=1845 cap=1845],1580107349095,Future()))
>  java.io.IOException: Not supported
>  at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
>  at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
>  at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
>  at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94)
>  at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)
>  at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175)
>  at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142)
>  at java.lang.Thread.run(Thread.java:748)
>  2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while 
> writing record: 
> BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769
>  to the WriteAheadLog.
>  org.apache.spark.SparkException: Exception thrown in awaitResult: 
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>  at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522)
>  at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.IOException: Not supported
>  at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSys

[jira] [Updated] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running

2020-01-30 Thread Mullaivendhan Ariaputhri (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mullaivendhan Ariaputhri updated SPARK-30677:
-
Description: 
Spark job stopped processing when the number of shards is increased when the 
job is already running.

We have observed the below exceptions.

 

2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
Failed to write to write ahead log
 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
Failed to write to write ahead log
 2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - 
Failed to write to write ahead log after 3 failures
 2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog Writer 
failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 lim=1845 
cap=1845],1580107349095,Future()))
 java.io.IOException: Not supported
 at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
 at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
 at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
 at 
org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142)
 at java.lang.Thread.run(Thread.java:748)
 2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while 
writing record: 
BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769
 to the WriteAheadLog.
 org.apache.spark.SparkException: Exception thrown in awaitResult: 
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: java.io.IOException: Not supported
 at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
 at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
 at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
 at 
org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)

[jira] [Created] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running

2020-01-30 Thread Mullaivendhan Ariaputhri (Jira)

Mullaivendhan Ariaputhri created SPARK-30677:


 Summary: Spark Streaming Job stuck when Kinesis Shard is increased 
when the job is running
 Key: SPARK-30677
 URL: https://issues.apache.org/jira/browse/SPARK-30677
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Structured Streaming
Affects Versions: 2.4.3
Reporter: Mullaivendhan Ariaputhri


Spark job stopped processing when the number of shards is increased when the 
job is already running.

We have observed the below exceptions.

 

2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
Failed to write to write ahead log
2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - 
Failed to write to write ahead log
2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - 
Failed to write to write ahead log after 3 failures
2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog Writer 
failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 lim=1845 
cap=1845],1580107349095,Future()))
java.io.IOException: Not supported
 at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
 at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
 at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
 at 
org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142)
 at java.lang.Thread.run(Thread.java:748)
2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while 
writing record: 
BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769
 to the WriteAheadLog.
org.apache.spark.SparkException: Exception thrown in awaitResult: 
 at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Not supported
 at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588)
 at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
 at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
 at 
org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
 at 
org.apache.spark.streaming.util.FileBasedWriteAheadL

[jira] [Created] (SPARK-30676) Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double

2020-01-30 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30676:
--

 Summary: Eliminate warnings from deprecated constructors of 
java.lang.Integer and java.lang.Double
 Key: SPARK-30676
 URL: https://issues.apache.org/jira/browse/SPARK-30676
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The constructors of java.lang.Integer and java.lang.Double has been deprecated 
already, see 
[https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html]. The 
following warnings are printed while compiling Spark:
{code}
1. RDD.scala:240: constructor Integer in class Integer is deprecated: see 
corresponding Javadoc for more information.
2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is 
deprecated: see corresponding Javadoc for more information.
3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see 
corresponding Javadoc for more information.
4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see 
corresponding Javadoc for more information.
5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is 
deprecated: see corresponding Javadoc for more information.
{code}
 The ticket aims to replace the constructors by the valueOf methods, or maybe 
by other ways.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

75 matches

Mail list logo