[jira] [Commented] (SPARK-30670) Pipes for PySpark
[ https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027244#comment-17027244 ] Hyukjin Kwon commented on SPARK-30670: -- Yes, it's intentional in order to match it with Scala side. It is easily workaround - you can just directly pass the arguments inside to the function. > Pipes for PySpark > - > > Key: SPARK-30670 > URL: https://issues.apache.org/jira/browse/SPARK-30670 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Vincent >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > I would propose to add a `pipe` method to a Spark Dataframe. It allows for a > functional programming pattern that is inspired from the tidyverse that is > currently missing. The pandas community also recently adopted this pattern, > documented [here]([https://tomaugspurger.github.io/method-chaining.html).] > This is the idea. Suppose you had this; > {code:java} > # file that has [user, date, timestamp, eventtype] > ddf = spark.read.parquet("") > w_user = Window().partitionBy("user") > w_user_date = Window().partitionBy("user", "date") > w_user_time = Window().partitionBy("user").sortBy("timestamp") > thres_sesstime = 60 * 15 > min_n_rows = 10 > min_n_sessions = 5 > clean_ddf = (ddf > .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user)) > .withColumn("new_session", (sf.col("delta") > > thres_sesstime).cast("integer")) > .withColumn("session", sf.sum(sf.col("new_session")).over(w_user)) > .drop("new_session") > .drop("delta") > .withColumn("nrow_user", sf.count(sf.col("timestamp"))) > .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date"))) > .filter(sf.col("nrow_user") > min_n_rows) > .filter(sf.col("nrow_user_date") > min_n_sessions) > .drop("nrow_user") > .drop("nrow_user_date")) > {code} > The code works and it is somewhat clear. We add a session to the dataframe > and then we use this to remove outliers. The issue is that this chain of > commands can get quite long so instead it might be better to turn this into > functions. > {code:java} > def add_session(dataf, session_threshold=60*15): > w_user = Window().partitionBy("user") > > return (dataf > .withColumn("delta", sf.col("timestamp") - > sf.lag("timestamp").over(w_user)) > .withColumn("new_session", (sf.col("delta") > > threshold_sesstime).cast("integer")) > .withColumn("session", sf.sum(sf.col("new_session")).over(w_user)) > .drop("new_session") > .drop("delta")) > def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5): > w_user_date = Window().partitionBy("user", "date") > w_user_time = Window().partitionBy("user").sortBy("timestamp") > > return (dataf > .withColumn("nrow_user", sf.count(sf.col("timestamp"))) > .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date"))) > .filter(sf.col("nrow_user") > min_n_rows) > .filter(sf.col("nrow_user_date") > min_n_sessions) > .drop("nrow_user") > .drop("nrow_user_date")) > {code} > The issue lies not in these functions. These functions are great! You can > unit test them and they really give nice verbs that function as an > abstraction. The issue is in how you now need to apply them. > {code:java} > remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11) > {code} > It'd be much nicer to perhaps allow for this; > {code:java} > (ddf > .pipe(add_session, session_threshold=900) > .pipe(remove_outliers, min_n_rows=11)) > {code} > The cool thing about this is that you can really easily allow for method > chaining but also that you have an amazing way to split high level code and > low level code. You still allow mutation as a high level by exposing keyword > arguments but you can easily find the lower level code in debugging because > you've contained details to their functions. > For code maintenance, I've relied on this pattern a lot personally. But > sofar, I've always monkey-patched spark to be able to do this. > {code:java} > from pyspark.sql import DataFrame > def pipe(self, func, *args, **kwargs): > return func(self, *args, **kwargs) > {code} > Could I perhaps add these few lines of code to the codebase? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30671) SparkSession emptyDataFrame should not create an RDD
[ https://issues.apache.org/jira/browse/SPARK-30671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30671. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27400 [https://github.com/apache/spark/pull/27400] > SparkSession emptyDataFrame should not create an RDD > > > Key: SPARK-30671 > URL: https://issues.apache.org/jira/browse/SPARK-30671 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.0.0 > > > SparkSession.emptyDataFrame uses an empty RDD in the background. This is a > bit of a pity because the optimizer can't recognize this as being empty and > it won't apply things like {{PropagateEmptyRelation}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30670) Pipes for PySpark
[ https://issues.apache.org/jira/browse/SPARK-30670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027242#comment-17027242 ] Vincent commented on SPARK-30670: - I just had a look, but transform does not allow for `*args` and `**kwargs`. Is there a reason for this? To me this feels like it is not feature-complete. > Pipes for PySpark > - > > Key: SPARK-30670 > URL: https://issues.apache.org/jira/browse/SPARK-30670 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Vincent >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > I would propose to add a `pipe` method to a Spark Dataframe. It allows for a > functional programming pattern that is inspired from the tidyverse that is > currently missing. The pandas community also recently adopted this pattern, > documented [here]([https://tomaugspurger.github.io/method-chaining.html).] > This is the idea. Suppose you had this; > {code:java} > # file that has [user, date, timestamp, eventtype] > ddf = spark.read.parquet("") > w_user = Window().partitionBy("user") > w_user_date = Window().partitionBy("user", "date") > w_user_time = Window().partitionBy("user").sortBy("timestamp") > thres_sesstime = 60 * 15 > min_n_rows = 10 > min_n_sessions = 5 > clean_ddf = (ddf > .withColumn("delta", sf.col("timestamp") - sf.lag("timestamp").over(w_user)) > .withColumn("new_session", (sf.col("delta") > > thres_sesstime).cast("integer")) > .withColumn("session", sf.sum(sf.col("new_session")).over(w_user)) > .drop("new_session") > .drop("delta") > .withColumn("nrow_user", sf.count(sf.col("timestamp"))) > .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date"))) > .filter(sf.col("nrow_user") > min_n_rows) > .filter(sf.col("nrow_user_date") > min_n_sessions) > .drop("nrow_user") > .drop("nrow_user_date")) > {code} > The code works and it is somewhat clear. We add a session to the dataframe > and then we use this to remove outliers. The issue is that this chain of > commands can get quite long so instead it might be better to turn this into > functions. > {code:java} > def add_session(dataf, session_threshold=60*15): > w_user = Window().partitionBy("user") > > return (dataf > .withColumn("delta", sf.col("timestamp") - > sf.lag("timestamp").over(w_user)) > .withColumn("new_session", (sf.col("delta") > > threshold_sesstime).cast("integer")) > .withColumn("session", sf.sum(sf.col("new_session")).over(w_user)) > .drop("new_session") > .drop("delta")) > def remove_outliers(dataf, min_n_rows=10, min_n_sessions=5): > w_user_date = Window().partitionBy("user", "date") > w_user_time = Window().partitionBy("user").sortBy("timestamp") > > return (dataf > .withColumn("nrow_user", sf.count(sf.col("timestamp"))) > .withColumn("nrow_user_date", sf.approx_count_distinct(sf.col("date"))) > .filter(sf.col("nrow_user") > min_n_rows) > .filter(sf.col("nrow_user_date") > min_n_sessions) > .drop("nrow_user") > .drop("nrow_user_date")) > {code} > The issue lies not in these functions. These functions are great! You can > unit test them and they really give nice verbs that function as an > abstraction. The issue is in how you now need to apply them. > {code:java} > remove_outliers(add_session(ddf, session_threshold=1000), min_n_rows=11) > {code} > It'd be much nicer to perhaps allow for this; > {code:java} > (ddf > .pipe(add_session, session_threshold=900) > .pipe(remove_outliers, min_n_rows=11)) > {code} > The cool thing about this is that you can really easily allow for method > chaining but also that you have an amazing way to split high level code and > low level code. You still allow mutation as a high level by exposing keyword > arguments but you can easily find the lower level code in debugging because > you've contained details to their functions. > For code maintenance, I've relied on this pattern a lot personally. But > sofar, I've always monkey-patched spark to be able to do this. > {code:java} > from pyspark.sql import DataFrame > def pipe(self, func, *args, **kwargs): > return func(self, *args, **kwargs) > {code} > Could I perhaps add these few lines of code to the codebase? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26449) Missing Dataframe.transform API in Python API
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027241#comment-17027241 ] Vincent commented on SPARK-26449: - Is there a reason why transform does not accept `*args` and **kwargs`? > Missing Dataframe.transform API in Python API > - > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Assignee: Erik Christiansen >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30690) Expose the documentation of CalendarInternval in API documentation
Hyukjin Kwon created SPARK-30690: Summary: Expose the documentation of CalendarInternval in API documentation Key: SPARK-30690 URL: https://issues.apache.org/jira/browse/SPARK-30690 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon We should also expose it in documentation as we marked it as unstable API as of SPARK-30547 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30065: -- Fix Version/s: 2.4.5 > Unable to drop na with duplicate columns > > > Key: SPARK-30065 > URL: https://issues.apache.org/jira/browse/SPARK-30065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Trying to drop rows with null values fails even when no columns are > specified. This should be allowed: > {code:java} > scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") > left: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") > right: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val df = left.join(right, Seq("col1")) > df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more > field] > scala> df.show > ++++ > |col1|col2|col2| > ++++ > | 1|null| 2| > | 3| 4|null| > ++++ > scala> df.na.drop("any") > org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could > be: col2, col2.; > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30065) Unable to drop na with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027239#comment-17027239 ] Dongjoon Hyun commented on SPARK-30065: --- This is backported to `branch-2.4` via [https://github.com/apache/spark/pull/27411] . > Unable to drop na with duplicate columns > > > Key: SPARK-30065 > URL: https://issues.apache.org/jira/browse/SPARK-30065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Trying to drop rows with null values fails even when no columns are > specified. This should be allowed: > {code:java} > scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") > left: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") > right: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val df = left.join(right, Seq("col1")) > df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more > field] > scala> df.show > ++++ > |col1|col2|col2| > ++++ > | 1|null| 2| > | 3| 4|null| > ++++ > scala> df.na.drop("any") > org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could > be: col2, col2.; > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30669) Introduce AdmissionControl API to Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-30669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-30669. - Fix Version/s: 3.0.0 Assignee: Burak Yavuz Resolution: Done Resolved by [https://github.com/apache/spark/pull/27380] > Introduce AdmissionControl API to Structured Streaming > -- > > Key: SPARK-30669 > URL: https://issues.apache.org/jira/browse/SPARK-30669 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > In Structured Streaming, we have the concept of Triggers. With a trigger like > Trigger.Once(), the semantics are to process all the data available to the > datasource in a single micro-batch. However, this semantic can be broken when > data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate > limit the amount of data read for that micro-batch. > We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. > A ReadLimit defines how much data should be read in the next micro-batch. > `SupportsAdmissionControl` specifies that a source can rate limit its ingest > into the system. The source can tell the system what the user specified as a > read limit, and the system can enforce this limit within each micro-batch or > impose it's own limit if the Trigger is Trigger.Once() for example. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30362) InputMetrics are not updated for DataSourceRDD V2
[ https://issues.apache.org/jira/browse/SPARK-30362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30362: --- Assignee: Sandeep Katta > InputMetrics are not updated for DataSourceRDD V2 > -- > > Key: SPARK-30362 > URL: https://issues.apache.org/jira/browse/SPARK-30362 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Attachments: inputMetrics.png > > > InputMetrics is not updated for DataSourceRDD -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30362) InputMetrics are not updated for DataSourceRDD V2
[ https://issues.apache.org/jira/browse/SPARK-30362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30362. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27021 [https://github.com/apache/spark/pull/27021] > InputMetrics are not updated for DataSourceRDD V2 > -- > > Key: SPARK-30362 > URL: https://issues.apache.org/jira/browse/SPARK-30362 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.0.0 > > Attachments: inputMetrics.png > > > InputMetrics is not updated for DataSourceRDD -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29665) refine the TableProvider interface
[ https://issues.apache.org/jira/browse/SPARK-29665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29665. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26868 [https://github.com/apache/spark/pull/26868] > refine the TableProvider interface > -- > > Key: SPARK-29665 > URL: https://issues.apache.org/jira/browse/SPARK-29665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive
[ https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027180#comment-17027180 ] Wenchen Fan commented on SPARK-26021: - FYI: This fix has been replaced by a better fix that retains the difference between -0.0 and 0.0, by https://github.com/apache/spark/pull/23388 > -0.0 and 0.0 not treated consistently, doesn't match Hive > - > > Key: SPARK-26021 > URL: https://issues.apache.org/jira/browse/SPARK-26021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean R. Owen >Assignee: Alon Doron >Priority: Critical > Labels: correctness > Fix For: 3.0.0 > > > Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new > issue: > The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are > numerically identical but not the same double value: > In hive, 0.0 and -0.0 are equal since > https://issues.apache.org/jira/browse/HIVE-11174. > That's not the case with spark sql as "group by" (non-codegen) treats them > as different values. Since their hash is different they're put in different > buckets of UnsafeFixedWidthAggregationMap. > In addition there's an inconsistency when using the codegen, for example the > following unit test: > {code:java} > println(Seq(0.0d, 0.0d, > -0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,3] > {code:java} > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,1], [-0.0,2] > {code:java} > spark.conf.set("spark.sql.codegen.wholeStage", "false") > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,2], [-0.0,1] > Note that the only difference between the first 2 lines is the order of the > elements in the Seq. > This inconsistency is resulted by different partitioning of the Seq and the > usage of the generated fast hash map in the first, partial, aggregation. > It looks like we need to add a specific check for -0.0 before hashing (both > in codegen and non-codegen modes) if we want to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29438) Failed to get state store in stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-29438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-29438. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26162 [https://github.com/apache/spark/pull/26162] > Failed to get state store in stream-stream join > --- > > Key: SPARK-29438 > URL: https://issues.apache.org/jira/browse/SPARK-29438 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Genmao Yu >Assignee: Jungtaek Lim >Priority: Critical > Fix For: 3.0.0 > > > Now, Spark use the `TaskPartitionId` to determine the StateStore path. > {code:java} > TaskPartitionId \ > StateStoreVersion --> StoreProviderId -> StateStore > StateStoreName/ > {code} > In spark stages, the task partition id is determined by the number of tasks. > As we said the StateStore file path depends on the task partition id. So if > stream-stream join task partition id is changed against last batch, it will > get wrong StateStore data or fail with non-exist StateStore data. In some > corner cases, it happened. Following is a sample pseudocode: > {code:java} > val df3 = streamDf1.join(streamDf2) > val df5 = streamDf3.join(batchDf4) > val df = df3.union(df5) > df.writeStream...start() > {code} > A simplified DAG like this: > {code:java} > DataSourceV2Scan Scan Relation DataSourceV2Scan DataSourceV2Scan > (streamDf3)| (streamDf1)(streamDf2) > | | | | > Exchange(200) Exchange(200) Exchange(200) Exchange(200) > | | | | >SortSort | | > \ / \ / > \/ \ / > SortMergeJoinStreamingSymmetricHashJoin > \ / >\ / > \ / > Union > {code} > Stream-Steam join task Id will start from 200 to 399 as they are in the same > stage with `SortMergeJoin`. But when there is no new incoming data in > `streamDf3` in some batch, it will generate a empty LocalRelation, and then > the SortMergeJoin will be replaced with a BroadcastHashJoin. In this case, > Stream-Steam join task Id will start from 1 to 200. Finally, it will get > wrong StateStore path through TaskPartitionId, and failed with error reading > state store delta file. > {code:java} > LocalTableScan Scan Relation DataSourceV2Scan DataSourceV2Scan > | | | | > BroadcastExchange | Exchange(200) Exchange(200) > | | | | > | | | | > \/ \ / >\ /\ / > BroadcastHashJoin StreamingSymmetricHashJoin > \ / >\ / > \ / > Union > {code} > In my job, I closed the auto BroadcastJoin feature (set > spark.sql.autoBroadcastJoinThreshold=-1) to walk around this bug. We should > make the StateStore path determinate but not depends on TaskPartitionId. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29438) Failed to get state store in stream-stream join
[ https://issues.apache.org/jira/browse/SPARK-29438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-29438: - Assignee: Jungtaek Lim > Failed to get state store in stream-stream join > --- > > Key: SPARK-29438 > URL: https://issues.apache.org/jira/browse/SPARK-29438 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Genmao Yu >Assignee: Jungtaek Lim >Priority: Critical > > Now, Spark use the `TaskPartitionId` to determine the StateStore path. > {code:java} > TaskPartitionId \ > StateStoreVersion --> StoreProviderId -> StateStore > StateStoreName/ > {code} > In spark stages, the task partition id is determined by the number of tasks. > As we said the StateStore file path depends on the task partition id. So if > stream-stream join task partition id is changed against last batch, it will > get wrong StateStore data or fail with non-exist StateStore data. In some > corner cases, it happened. Following is a sample pseudocode: > {code:java} > val df3 = streamDf1.join(streamDf2) > val df5 = streamDf3.join(batchDf4) > val df = df3.union(df5) > df.writeStream...start() > {code} > A simplified DAG like this: > {code:java} > DataSourceV2Scan Scan Relation DataSourceV2Scan DataSourceV2Scan > (streamDf3)| (streamDf1)(streamDf2) > | | | | > Exchange(200) Exchange(200) Exchange(200) Exchange(200) > | | | | >SortSort | | > \ / \ / > \/ \ / > SortMergeJoinStreamingSymmetricHashJoin > \ / >\ / > \ / > Union > {code} > Stream-Steam join task Id will start from 200 to 399 as they are in the same > stage with `SortMergeJoin`. But when there is no new incoming data in > `streamDf3` in some batch, it will generate a empty LocalRelation, and then > the SortMergeJoin will be replaced with a BroadcastHashJoin. In this case, > Stream-Steam join task Id will start from 1 to 200. Finally, it will get > wrong StateStore path through TaskPartitionId, and failed with error reading > state store delta file. > {code:java} > LocalTableScan Scan Relation DataSourceV2Scan DataSourceV2Scan > | | | | > BroadcastExchange | Exchange(200) Exchange(200) > | | | | > | | | | > \/ \ / >\ /\ / > BroadcastHashJoin StreamingSymmetricHashJoin > \ / >\ / > \ / > Union > {code} > In my job, I closed the auto BroadcastJoin feature (set > spark.sql.autoBroadcastJoinThreshold=-1) to walk around this bug. We should > make the StateStore path determinate but not depends on TaskPartitionId. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30656) Support the "minPartitions" option in Kafka batch source and streaming source v1
[ https://issues.apache.org/jira/browse/SPARK-30656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-30656. -- Fix Version/s: 3.0.0 Resolution: Done > Support the "minPartitions" option in Kafka batch source and streaming source > v1 > > > Key: SPARK-30656 > URL: https://issues.apache.org/jira/browse/SPARK-30656 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.0 > > > Right now, the "minPartitions" option only works in Kafka streaming source > v2. It would be great that we can support it in batch and streaming source v1 > (v1 is the fallback mode when a user hits a regression in v2) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29890: -- Fix Version/s: 2.4.5 > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Assignee: Terry Kim >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling
[ https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30689: -- Description: Many people/companies will not be moving to Hadoop 3.1 or greater, where it supports custom resource scheduling for things like GPUs soon and have requested support for it in older hadoop 2.x versions. This also means that they may not have isolation enabled which is what the default behavior relies on. right now the option is to write a custom discovery script to handle on their own. This is ok but has some limitation because the script runs as a separate process. It also just a shell script. I think we can make this a lot more flexible by making the entire resource discovery class pluggable. The default one would stay as is and call the discovery script, but if an advanced user wanted to replace the entire thing they could implement a pluggable class which they could write custom code on how to discovery resource addresses. This will also help users if they are running hadoop 3.1.x or greater but don't have the resources configured or aren't running in an isolated environment. was: Many people/companies will not be moving to Hadoop 3.1 or greater, where it supports custom resource scheduling for things like GPUs soon and have requested support for it in older hadoop 2.x versions. This also means that they may not have isolation enabled which is what the default behavior relies on. right now the option is to write a custom discovery script to handle on their own. This is ok but has some limitation because the script runs as a separate process. It also just a shell script. I think we can make this a lot more flexible by making the entire resource discovery class pluggable. The default one would stay as is and call the discovery script, but if an advanced user wanted to replace the entire thing they could implement a pluggable class which they could write custom code on how to discovery resource addresses. > Allow custom resource scheduling to work with YARN versions that don't > support custom resource scheduling > - > > Key: SPARK-30689 > URL: https://issues.apache.org/jira/browse/SPARK-30689 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Many people/companies will not be moving to Hadoop 3.1 or greater, where it > supports custom resource scheduling for things like GPUs soon and have > requested support for it in older hadoop 2.x versions. This also means that > they may not have isolation enabled which is what the default behavior relies > on. > right now the option is to write a custom discovery script to handle on their > own. This is ok but has some limitation because the script runs as a separate > process. It also just a shell script. > I think we can make this a lot more flexible by making the entire resource > discovery class pluggable. The default one would stay as is and call the > discovery script, but if an advanced user wanted to replace the entire thing > they could implement a pluggable class which they could write custom code on > how to discovery resource addresses. > This will also help users if they are running hadoop 3.1.x or greater but > don't have the resources configured or aren't running in an isolated > environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling
[ https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30689: -- Description: Many people/companies will not be moving to Hadoop 3.1 or greater, where it supports custom resource scheduling for things like GPUs soon and have requested support for it in older hadoop 2.x versions. This also means that they may not have isolation enabled which is what the default behavior relies on. right now the option is to write a custom discovery script to handle on their own. This is ok but has some limitation because the script runs as a separate process. It also just a shell script. I think we can make this a lot more flexible by making the entire resource discovery class pluggable. The default one would stay as is and call the discovery script, but if an advanced user wanted to replace the entire thing they could implement a pluggable class which they could write custom code on how to discovery resource addresses. was: Many people/companies will not be moving to Hadoop 3.0 that it supports custom resource scheduling for things like GPUs soon and have requested support for it in older hadoop 2.x versions. This also means that they may not have isolation enabled which is what the default behavior relies on. right now the option is to write a custom discovery script to handle on their own. This is ok but has some limitation because the script runs as a separate process. It also just a shell script. I think we can make this a lot more flexible by making the entire resource discovery class pluggable. The default one would stay as is and call the discovery script, but if an advanced user wanted to replace the entire thing they could implement a pluggable class which they could write custom code on how to discovery resource addresses. > Allow custom resource scheduling to work with YARN versions that don't > support custom resource scheduling > - > > Key: SPARK-30689 > URL: https://issues.apache.org/jira/browse/SPARK-30689 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Many people/companies will not be moving to Hadoop 3.1 or greater, where it > supports custom resource scheduling for things like GPUs soon and have > requested support for it in older hadoop 2.x versions. This also means that > they may not have isolation enabled which is what the default behavior relies > on. > right now the option is to write a custom discovery script to handle on their > own. This is ok but has some limitation because the script runs as a separate > process. It also just a shell script. > I think we can make this a lot more flexible by making the entire resource > discovery class pluggable. The default one would stay as is and call the > discovery script, but if an advanced user wanted to replace the entire thing > they could implement a pluggable class which they could write custom code on > how to discovery resource addresses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling
[ https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30689: -- Issue Type: Improvement (was: Bug) > Allow custom resource scheduling to work with YARN versions that don't > support custom resource scheduling > - > > Key: SPARK-30689 > URL: https://issues.apache.org/jira/browse/SPARK-30689 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Many people/companies will not be moving to Hadoop 3.1 or greater, where it > supports custom resource scheduling for things like GPUs soon and have > requested support for it in older hadoop 2.x versions. This also means that > they may not have isolation enabled which is what the default behavior relies > on. > right now the option is to write a custom discovery script to handle on their > own. This is ok but has some limitation because the script runs as a separate > process. It also just a shell script. > I think we can make this a lot more flexible by making the entire resource > discovery class pluggable. The default one would stay as is and call the > discovery script, but if an advanced user wanted to replace the entire thing > they could implement a pluggable class which they could write custom code on > how to discovery resource addresses. > This will also help users if they are running hadoop 3.1.x or greater but > don't have the resources configured or aren't running in an isolated > environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30689) Allow custom resource scheduling to work with Hadoop versions that don't support it
Thomas Graves created SPARK-30689: - Summary: Allow custom resource scheduling to work with Hadoop versions that don't support it Key: SPARK-30689 URL: https://issues.apache.org/jira/browse/SPARK-30689 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 3.0.0 Reporter: Thomas Graves Many people/companies will not be moving to Hadoop 3.0 that it supports custom resource scheduling for things like GPUs soon and have requested support for it in older hadoop 2.x versions. This also means that they may not have isolation enabled which is what the default behavior relies on. right now the option is to write a custom discovery script to handle on their own. This is ok but has some limitation because the script runs as a separate process. It also just a shell script. I think we can make this a lot more flexible by making the entire resource discovery class pluggable. The default one would stay as is and call the discovery script, but if an advanced user wanted to replace the entire thing they could implement a pluggable class which they could write custom code on how to discovery resource addresses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling
[ https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30689: -- Summary: Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling (was: Allow custom resource scheduling to work with Hadoop versions that don't support it) > Allow custom resource scheduling to work with YARN versions that don't > support custom resource scheduling > - > > Key: SPARK-30689 > URL: https://issues.apache.org/jira/browse/SPARK-30689 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Many people/companies will not be moving to Hadoop 3.0 that it supports > custom resource scheduling for things like GPUs soon and have requested > support for it in older hadoop 2.x versions. This also means that they may > not have isolation enabled which is what the default behavior relies on. > right now the option is to write a custom discovery script to handle on their > own. This is ok but has some limitation because the script runs as a separate > process. It also just a shell script. > I think we can make this a lot more flexible by making the entire resource > discovery class pluggable. The default one would stay as is and call the > discovery script, but if an advanced user wanted to replace the entire thing > they could implement a pluggable class which they could write custom code on > how to discovery resource addresses. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
Rajkumar Singh created SPARK-30688: -- Summary: Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF Key: SPARK-30688 URL: https://issues.apache.org/jira/browse/SPARK-30688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: Rajkumar Singh {code:java} scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); +-+ |unix_timestamp(20201, ww)| +-+ | null| +-+ scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); -+ |unix_timestamp(20202, ww)| +-+ | 1578182400| +-+ {code} This seems to happen for leap year only, I dig deeper into it and it seems that Spark is using the java.text.SimpleDateFormat and try to parse the expression here [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] {code:java} formatter.parse( t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} but fail and SimpleDateFormat unable to parse the date throw Unparseable Exception but Spark handle it silently and returns NULL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row
Bao Nguyen created SPARK-30687: -- Summary: When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row Key: SPARK-30687 URL: https://issues.apache.org/jira/browse/SPARK-30687 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Bao Nguyen When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row instead of setting the value at that cell to be null. {code:java} case class TestModel( num: Double, test: String, mac: String, value: Double ) val schema = ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType] //here's the content of the file test.data //1~test~mac1~2 //1.0~testdatarow2~mac2~non-numeric //2~test1~mac1~3 val ds = spark .read .schema(schema) .option("delimiter", "~") .csv("/test-data/test.data") ds.show(); //the content of data frame. second row is all null. // ++-++-+ // | num| test| mac|value| // ++-++-+ // | 1.0| test|mac1| 2.0| // |null| null|null| null| // | 2.0|test1|mac1| 3.0| // ++-++-+ //should be // ++--++-+ // | num| test | mac|value| // ++--++-+ // | 1.0| test |mac1| 2.0 | // |1.0 |testdatarow2 |mac2| null| // | 2.0|test1 |mac1| 3.0 | // ++--++-+{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30065: -- Affects Version/s: 2.0.2 > Unable to drop na with duplicate columns > > > Key: SPARK-30065 > URL: https://issues.apache.org/jira/browse/SPARK-30065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to drop rows with null values fails even when no columns are > specified. This should be allowed: > {code:java} > scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") > left: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") > right: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val df = left.join(right, Seq("col1")) > df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more > field] > scala> df.show > ++++ > |col1|col2|col2| > ++++ > | 1|null| 2| > | 3| 4|null| > ++++ > scala> df.na.drop("any") > org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could > be: col2, col2.; > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30065: -- Affects Version/s: 2.1.3 > Unable to drop na with duplicate columns > > > Key: SPARK-30065 > URL: https://issues.apache.org/jira/browse/SPARK-30065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to drop rows with null values fails even when no columns are > specified. This should be allowed: > {code:java} > scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") > left: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") > right: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val df = left.join(right, Seq("col1")) > df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more > field] > scala> df.show > ++++ > |col1|col2|col2| > ++++ > | 1|null| 2| > | 3| 4|null| > ++++ > scala> df.na.drop("any") > org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could > be: col2, col2.; > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30065: -- Affects Version/s: 2.2.3 2.3.4 2.4.4 > Unable to drop na with duplicate columns > > > Key: SPARK-30065 > URL: https://issues.apache.org/jira/browse/SPARK-30065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to drop rows with null values fails even when no columns are > specified. This should be allowed: > {code:java} > scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") > left: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") > right: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val df = left.join(right, Seq("col1")) > df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more > field] > scala> df.show > ++++ > |col1|col2|col2| > ++++ > | 1|null| 2| > | 3| 4|null| > ++++ > scala> df.na.drop("any") > org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could > be: col2, col2.; > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29890: -- Affects Version/s: 2.0.2 > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29890: -- Affects Version/s: 2.1.3 > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29890) Unable to fill na with 0 with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29890: -- Affects Version/s: 2.2.3 > Unable to fill na with 0 with duplicate columns > --- > > Key: SPARK-29890 > URL: https://issues.apache.org/jira/browse/SPARK-29890 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.3, 2.3.3, 2.4.3 >Reporter: sandeshyapuram >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to fill out na values with 0. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > val parent = > spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc") > val c1 = parent.filter(lit(true)) > val c2 = parent.filter(lit(true)) > c1.join(c2, Seq("nums"), "left") > .na.fill(0).show{noformat} > {noformat} > 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: > error looking up the name of group 820818257: No such file or directory > org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could > be: abc, abc.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1246) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) > ... 54 elided{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30065) Unable to drop na with duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-30065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30065: -- Issue Type: Bug (was: Improvement) > Unable to drop na with duplicate columns > > > Key: SPARK-30065 > URL: https://issues.apache.org/jira/browse/SPARK-30065 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Trying to drop rows with null values fails even when no columns are > specified. This should be allowed: > {code:java} > scala> val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2") > left: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2") > right: org.apache.spark.sql.DataFrame = [col1: string, col2: string] > scala> val df = left.join(right, Seq("col1")) > df: org.apache.spark.sql.DataFrame = [col1: string, col2: string ... 1 more > field] > scala> df.show > ++++ > |col1|col2|col2| > ++++ > | 1|null| 2| > | 3| 4|null| > ++++ > scala> df.na.drop("any") > org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could > be: col2, col2.; > at > org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30638) add resources as parameter to the PluginContext
[ https://issues.apache.org/jira/browse/SPARK-30638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-30638: -- Description: Add the allocates resources to parameters to the PluginContext so that any plugins in driver or executor could use this information to initialize devices or use this information in a useful manner. (was: Add the allocates resources and ResourceProfile to parameters to the PluginContext so that any plugins in driver or executor could use this information to initialize devices or use this information in a useful manner.) > add resources as parameter to the PluginContext > --- > > Key: SPARK-30638 > URL: https://issues.apache.org/jira/browse/SPARK-30638 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > Add the allocates resources to parameters to the PluginContext so that any > plugins in driver or executor could use this information to initialize > devices or use this information in a useful manner. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30673) Test cases in HiveShowCreateTableSuite should create Hive table instead of Datasource table
[ https://issues.apache.org/jira/browse/SPARK-30673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-30673. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27393 [https://github.com/apache/spark/pull/27393] > Test cases in HiveShowCreateTableSuite should create Hive table instead of > Datasource table > --- > > Key: SPARK-30673 > URL: https://issues.apache.org/jira/browse/SPARK-30673 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.0.0 > > > Because SparkSQL now creates data source table if no provider is specified in > SQL command, some test cases in HiveShowCreateTableSuite don't create Hive > table, but data source table. > It is confusing and not good for the purpose of this test suite. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21823) ALTER TABLE table statements such as RENAME and CHANGE columns should raise error if there are any dependent constraints.
[ https://issues.apache.org/jira/browse/SPARK-21823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026944#comment-17026944 ] Sunitha Kambhampati commented on SPARK-21823: - This issue is as part of the new Informational RI support related changes that will be required to the 2 existing DDLs as mentioned in the description. This is dependent on SPARK-21784 which adds the initial ddl support to add RI constraints. It was waiting on review from community. > ALTER TABLE table statements such as RENAME and CHANGE columns should raise > error if there are any dependent constraints. > - > > Key: SPARK-21823 > URL: https://issues.apache.org/jira/browse/SPARK-21823 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Suresh Thalamati >Priority: Major > > Following ALTER TABLE DDL statements will impact the informational > constraints defined on a table: > {code:sql} > ALTER TABLE name RENAME TO new_name > ALTER TABLE name CHANGE column_name new_name new_type > {code} > Spark SQL should raise errors if there are informational constraints > defined on the columns affected by the ALTER and let the user drop > constraints before proceeding with the DDL. In the future we can enhance the > ALTER to automatically fix up the constraint definition in the catalog when > possible, and not raise error > When spark adds support for DROP/REPLACE of columns they will impact > informational constraints. > {code:sql} > ALTER TABLE name DROP [COLUMN] column_name > ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30662) ALS/MLP extend HasBlockSize
[ https://issues.apache.org/jira/browse/SPARK-30662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30662: Assignee: Huaxin Gao > ALS/MLP extend HasBlockSize > --- > > Key: SPARK-30662 > URL: https://issues.apache.org/jira/browse/SPARK-30662 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30662) ALS/MLP extend HasBlockSize
[ https://issues.apache.org/jira/browse/SPARK-30662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30662. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27389 [https://github.com/apache/spark/pull/27389] > ALS/MLP extend HasBlockSize > --- > > Key: SPARK-30662 > URL: https://issues.apache.org/jira/browse/SPARK-30662 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13
[ https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026936#comment-17026936 ] Sean R. Owen commented on SPARK-29292: -- Right, I agree. I can try updating my fork and pushing the commits to a branch, for evaluation in _2.12_ at least. I just wasn't bothering until it seemed like 2.13 support blockers were removed, to do it once. Happy to do it though if you want to play with it and/or just see the scope of the change. > Fix internal usages of mutable collection for Seq in 2.13 > - > > Key: SPARK-29292 > URL: https://issues.apache.org/jira/browse/SPARK-29292 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a > simpler subset. > In 2.13, a mutable collection can't be returned as a > {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that > still works on 2.12. > {code} > [ERROR] [Error] > /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467: > type mismatch; > found : Seq[String] (in scala.collection) > required: Seq[String] (in scala.collection.immutable) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error
[ https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026929#comment-17026929 ] Behroz Sikander commented on SPARK-30686: - 1- I don't know the exact steps, i tried to reproduce but I couldn't figure out. 2- I don't understand the question. If you are asking about the response of /api/v1/version endpoint, then yes, it is also returning this. > Spark 2.4.4 metrics endpoint throwing error > --- > > Key: SPARK-30686 > URL: https://issues.apache.org/jira/browse/SPARK-30686 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Behroz Sikander >Priority: Major > > I am using Spark-standalone in HA mode with zookeeper. > Once the driver is up and running, whenever I try to access the metrics api > using the following URL > http://master_address/proxy/app-20200130041234-0123/api/v1/applications > I get the following exception. > It seems that the request never even reaches the spark code. It would be > helpful if somebody can help me. > {code:java} > HTTP ERROR 500 > Problem accessing /api/v1/applications. Reason: > Server Error > Caused by: > java.lang.NullPointerException: while trying to invoke the method > org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, > javax.servlet.http.HttpServletRequest, > javax.servlet.http.HttpServletResponse) of a null object loaded from field > org.glassfish.jersey.servlet.ServletContainer.webComponent of an object > loaded from local variable 'this' > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:808) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29292) Fix internal usages of mutable collection for Seq in 2.13
[ https://issues.apache.org/jira/browse/SPARK-29292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026914#comment-17026914 ] Kousuke Saruta commented on SPARK-29292: I was about to try to work on this so I've confirmed. I also think .toSeq affects few. If mutable.Seq.toSeq is called, references are copied but it should be small effect. But running some benchmarks might be better. > Fix internal usages of mutable collection for Seq in 2.13 > - > > Key: SPARK-29292 > URL: https://issues.apache.org/jira/browse/SPARK-29292 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > Kind of related to https://issues.apache.org/jira/browse/SPARK-27681, but a > simpler subset. > In 2.13, a mutable collection can't be returned as a > {{scala.collection.Seq}}. It's easy enough to call .toSeq on these as that > still works on 2.12. > {code} > [ERROR] [Error] > /Users/seanowen/Documents/spark_2.13/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:467: > type mismatch; > found : Seq[String] (in scala.collection) > required: Seq[String] (in scala.collection.immutable) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30680) ResolvedNamespace does not require a namespace catalog
[ https://issues.apache.org/jira/browse/SPARK-30680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30680. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27403 [https://github.com/apache/spark/pull/27403] > ResolvedNamespace does not require a namespace catalog > -- > > Key: SPARK-30680 > URL: https://issues.apache.org/jira/browse/SPARK-30680 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30622) commands should return dummy statistics
[ https://issues.apache.org/jira/browse/SPARK-30622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30622. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27344 [https://github.com/apache/spark/pull/27344] > commands should return dummy statistics > --- > > Key: SPARK-30622 > URL: https://issues.apache.org/jira/browse/SPARK-30622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29639) Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch
[ https://issues.apache.org/jira/browse/SPARK-29639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29639: -- Target Version/s: (was: 2.4.4, 2.4.5) > Spark Kafka connector 0.10.0 generates incorrect end offsets for micro batch > > > Key: SPARK-29639 > URL: https://issues.apache.org/jira/browse/SPARK-29639 > Project: Spark > Issue Type: Bug > Components: Input/Output, Structured Streaming >Affects Versions: 2.4.0 >Reporter: Abhinav Choudhury >Priority: Major > > We have been running a Spark structured job on production for more than a > week now. Put simply, it reads data from source Kafka topics (with 4 > partitions) and writes to another kafka topic. Everything has been running > fine until the job started failing with the following error: > > {noformat} > Driver stacktrace: > === Streaming Query === > Identifier: MetricComputer [id = af26bab1-0d89-4766-934e-ad5752d6bb08, runId > = 613a21ad-86e3-4781-891b-17d92c18954a] > Current Committed Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: > {"kafka-topic-name": > {"2":10458347,"1":10460151,"3":10475678,"0":9809564} > }} > Current Available Offsets: {KafkaV2[Subscribe[kafka-topic-name]]: > {"kafka-topic-name": > {"2":10458347,"1":10460151,"3":10475678,"0":10509527} > }} > Current State: ACTIVE > Thread State: RUNNABLE > <-- Removed Logical plan --> > Some data may have been lost because they are not available in Kafka any > more; either the > data was aged out by Kafka or the topic may have been deleted before all the > data in the > topic was processed. If you don't want your streaming query to fail on such > cases, set the > source option "failOnDataLoss" to "false".{noformat} > Configuration: > {noformat} > Spark 2.4.0 > Spark-sql-kafka 0.10{noformat} > Looking at the Spark structured streaming query progress logs, it seems like > the endOffsets computed for the next batch was actually smaller than the > starting offset: > *Microbatch Trigger 1:* > {noformat} > 2019/10/26 23:53:51 INFO utils.Logging[26]: 2019-10-26 23:53:51.767 : ( : > Query { > "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", > "runId" : "2d20d633-2768-446c-845b-893243361422", > "name" : "StreamingProcessorName", > "timestamp" : "2019-10-26T23:53:51.741Z", > "batchId" : 2145898, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0, > "durationMs" : { > "getEndOffset" : 0, > "setOffsetRange" : 9, > "triggerExecution" : 9 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[kafka-topic-name]]", > "startOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "endOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0 > } ], > "sink" : { > "description" : "ForeachBatchSink" > } > } in progress{noformat} > *Next micro batch trigger:* > {noformat} > 2019/10/26 23:53:53 INFO utils.Logging[26]: 2019-10-26 23:53:53.951 : ( : > Query { > "id" : "99fe6c51-9f4a-4f6f-92d3-3c336ef5e06b", > "runId" : "2d20d633-2768-446c-845b-893243361422", > "name" : "StreamingProcessorName", > "timestamp" : "2019-10-26T23:53:52.907Z", > "batchId" : 2145898, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0, > "durationMs" : { > "addBatch" : 350, > "getBatch" : 4, > "getEndOffset" : 0, > "queryPlanning" : 102, > "setOffsetRange" : 24, > "triggerExecution" : 1043, > "walCommit" : 349 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[kafka-topic-name]]", > "startOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 10469196, > "0" : 10503762 > } > }, > "endOffset" : { > "kafka-topic-name" : { > "2" : 10452513, > "1" : 10454326, > "3" : 9773098, > "0" : 10503762 > } > }, > "numInputRows" : 0, > "inputRowsPerSecond" : 0.0, > "processedRowsPerSecond" : 0.0 > } ], > "sink" : { > "description" : "ForeachBatchSink" > } > } in progress{noformat} > Notice that for partition 3 of the kafka topic, the endOffsets are actually > smaller than the starting offsets! > Checked the HDFS checkpoint dir and the checkpointed offsets look fine and > point to the last committed offsets > Why is the end offset for a partition being computed to a smaller v
[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error
[ https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026886#comment-17026886 ] Dongjoon Hyun commented on SPARK-30686: --- Hi, [~bsikander]. Thank you for reporting. # Could you give us a reproducible step? # Did you see this in the Spark versions, too? > Spark 2.4.4 metrics endpoint throwing error > --- > > Key: SPARK-30686 > URL: https://issues.apache.org/jira/browse/SPARK-30686 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Behroz Sikander >Priority: Major > > I am using Spark-standalone in HA mode with zookeeper. > Once the driver is up and running, whenever I try to access the metrics api > using the following URL > http://master_address/proxy/app-20200130041234-0123/api/v1/applications > I get the following exception. > It seems that the request never even reaches the spark code. It would be > helpful if somebody can help me. > {code:java} > HTTP ERROR 500 > Problem accessing /api/v1/applications. Reason: > Server Error > Caused by: > java.lang.NullPointerException: while trying to invoke the method > org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, > javax.servlet.http.HttpServletRequest, > javax.servlet.http.HttpServletResponse) of a null object loaded from field > org.glassfish.jersey.servlet.ServletContainer.webComponent of an object > loaded from local variable 'this' > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) > at > org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:808) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30685) Support ANSI INSERT syntax
[ https://issues.apache.org/jira/browse/SPARK-30685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Knoll updated SPARK-30685: Issue Type: New Feature (was: Bug) > Support ANSI INSERT syntax > -- > > Key: SPARK-30685 > URL: https://issues.apache.org/jira/browse/SPARK-30685 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chris Knoll >Priority: Minor > > Related to the [ANSI SQL specification for insert > syntax](https://en.wikipedia.org/wiki/Insert_(SQL)), could the parsing and > underlying engine support the syntax of: > {{INSERT INTO () select }} > I think I read somewhere that there's some underlying technical detail where > the columns inserted into SPARK tables must have the selected columns match > the order of the table definition. But, if this is the case, isn't' there a > place in the parser-layer and execution-layer where the parser can translate > something like: > {{insert into someTable (col1,col2) > select someCol1, someCol2 from otherTable}} > Where someTable has 3 columns (col3,col2,col1) (note the order here), the > query is rewritten and sent to the engine as: > {{insert into someTable > select null, someCol2, someCol1 from otherTable}} > Note, the reordering and adding of the null column was done based on some > table metadata on someTable so it knew which columns from the INSERT() map > over to the columns from the select. > Is this possible? The lack of specifying column values is preventing our > project from SPARK being a supported platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-30678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30678: - Assignee: Maxim Gekk > Eliminate warnings from deprecated BisectingKMeansModel.computeCost > --- > > Key: SPARK-30678 > URL: https://issues.apache.org/jira/browse/SPARK-30678 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > The computeCost() method of the BisectingKMeansModel class has been > deprecated. That causes warnings while compiling BisectingKMeansSuite: > {code} > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] assert(model.computeCost(dataset) < 0.1) > [warn] ^ > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] assert(model.computeCost(dataset) == summary.trainingCost) > [warn] ^ > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] model.computeCost(dataset) > [warn] ^ > [warn] three warnings found > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-30678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30678. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27401 [https://github.com/apache/spark/pull/27401] > Eliminate warnings from deprecated BisectingKMeansModel.computeCost > --- > > Key: SPARK-30678 > URL: https://issues.apache.org/jira/browse/SPARK-30678 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The computeCost() method of the BisectingKMeansModel class has been > deprecated. That causes warnings while compiling BisectingKMeansSuite: > {code} > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] assert(model.computeCost(dataset) < 0.1) > [warn] ^ > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] assert(model.computeCost(dataset) == summary.trainingCost) > [warn] ^ > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] model.computeCost(dataset) > [warn] ^ > [warn] three warnings found > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30659) LogisticRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30659. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27374 [https://github.com/apache/spark/pull/27374] > LogisticRegression blockify input vectors > - > > Key: SPARK-30659 > URL: https://issues.apache.org/jira/browse/SPARK-30659 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error
[ https://issues.apache.org/jira/browse/SPARK-30686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026805#comment-17026805 ] Behroz Sikander commented on SPARK-30686: - Sometimes, I have also seen {code:java} Caused by: java.lang.NoClassDefFoundError: org/glassfish/jersey/internal/inject/AbstractBinder at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:863) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:529) at java.net.URLClassLoader.access$100(URLClassLoader.java:75) at java.net.URLClassLoader$1.run(URLClassLoader.java:430) at java.net.URLClassLoader$1.run(URLClassLoader.java:424) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:423) at java.lang.ClassLoader.loadClass(ClassLoader.java:490) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at org.glassfish.jersey.media.sse.SseFeature.configure(SseFeature.java:123) at org.glassfish.jersey.model.internal.CommonConfig.configureFeatures(CommonConfig.java:730) at org.glassfish.jersey.model.internal.CommonConfig.configureMetaProviders(CommonConfig.java:648) at org.glassfish.jersey.server.ResourceConfig.configureMetaProviders(ResourceConfig.java:829) at org.glassfish.jersey.server.ApplicationHandler.initialize(ApplicationHandler.java:453) at org.glassfish.jersey.server.ApplicationHandler.access$500(ApplicationHandler.java:184) at org.glassfish.jersey.server.ApplicationHandler$3.call(ApplicationHandler.java:350) at org.glassfish.jersey.server.ApplicationHandler$3.call(ApplicationHandler.java:347) at org.glassfish.jersey.internal.Errors.process(Errors.java:315) at org.glassfish.jersey.internal.Errors.process(Errors.java:297) at org.glassfish.jersey.internal.Errors.processWithException(Errors.java:255) at org.glassfish.jersey.server.ApplicationHandler.(ApplicationHandler.java:347) at org.glassfish.jersey.servlet.WebComponent.(WebComponent.java:392) at org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:177) at org.glassfish.jersey.servlet.ServletContainer.init(ServletContainer.java:369) at javax.servlet.GenericServlet.init(GenericServlet.java:244) at org.spark_project.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:643) at org.spark_project.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:499) at org.spark_project.jetty.servlet.ServletHolder.ensureInstance(ServletHolder.java:791) at org.spark_project.jetty.servlet.ServletHolder.prepare(ServletHolder.java:776) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:579) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.spark_project.jetty.server.Server.handle(Server.java:539) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:808) Caused by: java.lang.ClassNotFoundException: org.glassfish.jersey.internal.
[jira] [Created] (SPARK-30686) Spark 2.4.4 metrics endpoint throwing error
Behroz Sikander created SPARK-30686: --- Summary: Spark 2.4.4 metrics endpoint throwing error Key: SPARK-30686 URL: https://issues.apache.org/jira/browse/SPARK-30686 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: Behroz Sikander I am using Spark-standalone in HA mode with zookeeper. Once the driver is up and running, whenever I try to access the metrics api using the following URL http://master_address/proxy/app-20200130041234-0123/api/v1/applications I get the following exception. It seems that the request never even reaches the spark code. It would be helpful if somebody can help me. {code:java} HTTP ERROR 500 Problem accessing /api/v1/applications. Reason: Server Error Caused by: java.lang.NullPointerException: while trying to invoke the method org.glassfish.jersey.servlet.WebComponent.service(java.net.URI, java.net.URI, javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) of a null object loaded from field org.glassfish.jersey.servlet.ServletContainer.webComponent of an object loaded from local variable 'this' at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228) at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.spark_project.jetty.server.Server.handle(Server.java:539) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:808) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30685) Support ANSI INSERT syntax
Chris Knoll created SPARK-30685: --- Summary: Support ANSI INSERT syntax Key: SPARK-30685 URL: https://issues.apache.org/jira/browse/SPARK-30685 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Chris Knoll Related to the [ANSI SQL specification for insert syntax](https://en.wikipedia.org/wiki/Insert_(SQL)), could the parsing and underlying engine support the syntax of: {{INSERT INTO () select }} I think I read somewhere that there's some underlying technical detail where the columns inserted into SPARK tables must have the selected columns match the order of the table definition. But, if this is the case, isn't' there a place in the parser-layer and execution-layer where the parser can translate something like: {{insert into someTable (col1,col2) select someCol1, someCol2 from otherTable}} Where someTable has 3 columns (col3,col2,col1) (note the order here), the query is rewritten and sent to the engine as: {{insert into someTable select null, someCol2, someCol1 from otherTable}} Note, the reordering and adding of the null column was done based on some table metadata on someTable so it knew which columns from the INSERT() map over to the columns from the select. Is this possible? The lack of specifying column values is preventing our project from SPARK being a supported platform. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30684) Show the descripton of metrics for WholeStageCodegen in DAG viz
Kousuke Saruta created SPARK-30684: -- Summary: Show the descripton of metrics for WholeStageCodegen in DAG viz Key: SPARK-30684 URL: https://issues.apache.org/jira/browse/SPARK-30684 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the DAG viz for SQL, A WholeStageCodegen-node shows some metrics like `33 ms (1 ms, 1 ms, 26 ms (stage 16 (attempt 0): task 172))` but users can't understand what they mean because there are no description about those metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation
[ https://issues.apache.org/jira/browse/SPARK-30683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026785#comment-17026785 ] Loki Coyote commented on SPARK-30683: - This command seems like it would work for conversion and validation on most linux distros: {code:bash} cat spark-2.4.4-bin-hadoop2.7.tgz.sha512 | tr '\n' ' '| tr -d ' ' | awk -F ':' '{ print $2 "\t" $1 }'| sha512sum -c - {code} > Validating sha512 files for releases does not comply with Apache Project > documentation > -- > > Key: SPARK-30683 > URL: https://issues.apache.org/jira/browse/SPARK-30683 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0, 2.4.0, 2.4.4 >Reporter: Loki Coyote >Priority: Minor > > Spark release artifacts currently use gpg generated sha512 files. These are > not in the same format as ones generated via sha512sum, and can not be > validated easily using sha512sum. As the [Apache Project > instructions|https://www.apache.org/info/verification.html] for how to > validate a release explicitly instructs the use of sha512sum compatible > tools, there is no documentation for successfully validating a downloaded > release using the sha512 file. > > This subject has been [brought up > before|https://www.mail-archive.com/dev@spark.apache.org/msg20140.html] and > an issue to change to using sha512sum compatible formats was closed as > WontFix with little discussion; even if switching isn't warranted, at the > least instructions for validating using the sha512 files should be documented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation
[ https://issues.apache.org/jira/browse/SPARK-30683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Jacobsen updated SPARK-30683: -- Description: Spark release artifacts currently use gpg generated sha512 files. These are not in the same format as ones generated via sha512sum, and can not be validated easily using sha512sum. As the [Apache Project instructions|https://www.apache.org/info/verification.html] for how to validate a release explicitly instructs the use of sha512sum compatible tools, there is no documentation for successfully validating a downloaded release using the sha512 file. This subject has been [brought up before|https://www.mail-archive.com/dev@spark.apache.org/msg20140.html] and an issue to change to using sha512sum compatible formats was closed as WontFix with little discussion; even if switching isn't warranted, at the least instructions for validating using the sha512 files should be documented. was: Spark release artifacts currently use gpg generated sha512 files. These are not in the same format as ones generated via sha512sum, and can not be validated easily using sha512sum. As the [Apache Project instructions|[https://www.apache.org/info/verification.html]] for how to validate a release explicitly instructs the use of sha512sum compatible tools, there is no documentation for successfully validating a downloaded release using the sha512 file. This subject has been [brought up before|[https://www.mail-archive.com/dev@spark.apache.org/msg20140.html]] and an issue to change to using sha512sum compatible formats was closed as WontFix with little discussion; even if switching isn't warranted, at the least instructions for validating using the sha512 files should be documented. > Validating sha512 files for releases does not comply with Apache Project > documentation > -- > > Key: SPARK-30683 > URL: https://issues.apache.org/jira/browse/SPARK-30683 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0, 2.4.0, 2.4.4 >Reporter: Nick Jacobsen >Priority: Minor > > Spark release artifacts currently use gpg generated sha512 files. These are > not in the same format as ones generated via sha512sum, and can not be > validated easily using sha512sum. As the [Apache Project > instructions|https://www.apache.org/info/verification.html] for how to > validate a release explicitly instructs the use of sha512sum compatible > tools, there is no documentation for successfully validating a downloaded > release using the sha512 file. > > This subject has been [brought up > before|https://www.mail-archive.com/dev@spark.apache.org/msg20140.html] and > an issue to change to using sha512sum compatible formats was closed as > WontFix with little discussion; even if switching isn't warranted, at the > least instructions for validating using the sha512 files should be documented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation
[ https://issues.apache.org/jira/browse/SPARK-30683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Jacobsen updated SPARK-30683: -- Description: Spark release artifacts currently use gpg generated sha512 files. These are not in the same format as ones generated via sha512sum, and can not be validated easily using sha512sum. As the [Apache Project instructions|[https://www.apache.org/info/verification.html]] for how to validate a release explicitly instructs the use of sha512sum compatible tools, there is no documentation for successfully validating a downloaded release using the sha512 file. This subject has been [brought up before|[https://www.mail-archive.com/dev@spark.apache.org/msg20140.html]] and an issue to change to using sha512sum compatible formats was closed as WontFix with little discussion; even if switching isn't warranted, at the least instructions for validating using the sha512 files should be documented. was: Spark release artifacts currently use gpg generated sha512 files. These are not in the same format as ones generated via sha512sum, and can not be validated easily using sha512sum. As the [Apache Project instructions|[https://www.apache.org/info/verification.html]] for how to validate a release explicitly instructs the use of sha512sum compatible tools, there is no documentation for successfully validating a downloaded release using the sha512 file. This subject has been brought up before and an issue to change to using sha512sum compatible formats was closed as WontFix with little discussion; even if switching isn't warranted, at the least instructions for validating using the sha512 files should be documented. > Validating sha512 files for releases does not comply with Apache Project > documentation > -- > > Key: SPARK-30683 > URL: https://issues.apache.org/jira/browse/SPARK-30683 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0, 2.4.0, 2.4.4 >Reporter: Nick Jacobsen >Priority: Minor > > Spark release artifacts currently use gpg generated sha512 files. These are > not in the same format as ones generated via sha512sum, and can not be > validated easily using sha512sum. As the [Apache Project > instructions|[https://www.apache.org/info/verification.html]] for how to > validate a release explicitly instructs the use of sha512sum compatible > tools, there is no documentation for successfully validating a downloaded > release using the sha512 file. > > This subject has been [brought up > before|[https://www.mail-archive.com/dev@spark.apache.org/msg20140.html]] and > an issue to change to using sha512sum compatible formats was closed as > WontFix with little discussion; even if switching isn't warranted, at the > least instructions for validating using the sha512 files should be documented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30683) Validating sha512 files for releases does not comply with Apache Project documentation
Nick Jacobsen created SPARK-30683: - Summary: Validating sha512 files for releases does not comply with Apache Project documentation Key: SPARK-30683 URL: https://issues.apache.org/jira/browse/SPARK-30683 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.4.4, 2.4.0, 2.3.0 Reporter: Nick Jacobsen Spark release artifacts currently use gpg generated sha512 files. These are not in the same format as ones generated via sha512sum, and can not be validated easily using sha512sum. As the [Apache Project instructions|[https://www.apache.org/info/verification.html]] for how to validate a release explicitly instructs the use of sha512sum compatible tools, there is no documentation for successfully validating a downloaded release using the sha512 file. This subject has been brought up before and an issue to change to using sha512sum compatible formats was closed as WontFix with little discussion; even if switching isn't warranted, at the least instructions for validating using the sha512 files should be documented. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30681) Add higher order functions API to PySpark
[ https://issues.apache.org/jira/browse/SPARK-30681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-30681: --- Description: As of 3.0.0 higher order functions are available in SQL and Scala, but not in PySpark, forcing Python users to invoke these through {{expr}}, {{selectExpr}} or {{sql}}. This is error prone and not well documented. Spark should provide {{pyspark.sql}} wrappers that accept plain Python functions (of course within limits of {{(*Column) -> Column}}) as arguments. {code:python} df.select(transform("values", lambda c: trim(upper(c))) def increment_values(k: Column, v: Column) -> Column: return v + 1 df.select(transform_values("data"), increment_values) {code} was: As of 3.0.0 higher order functions are available in SQL and Scala, but not in PySpark, forcing Python users to invoke these through {{expr}}, {{selectExpr}} or {{sql}}. This is error prone and not well documented. Spark should provide {{pyspark.sql}} wrappers that accept plain Python functions (of course within limits of {{(*Column) -> Column}}) as arguments. > Add higher order functions API to PySpark > - > > Key: SPARK-30681 > URL: https://issues.apache.org/jira/browse/SPARK-30681 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > As of 3.0.0 higher order functions are available in SQL and Scala, but not in > PySpark, forcing Python users to invoke these through {{expr}}, > {{selectExpr}} or {{sql}}. > This is error prone and not well documented. Spark should provide > {{pyspark.sql}} wrappers that accept plain Python functions (of course within > limits of {{(*Column) -> Column}}) as arguments. > {code:python} > df.select(transform("values", lambda c: trim(upper(c))) > def increment_values(k: Column, v: Column) -> Column: > return v + 1 > df.select(transform_values("data"), increment_values) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30682) Add higher order functions API to SparkR
Maciej Szymkiewicz created SPARK-30682: -- Summary: Add higher order functions API to SparkR Key: SPARK-30682 URL: https://issues.apache.org/jira/browse/SPARK-30682 Project: Spark Issue Type: Improvement Components: SparkR, SQL Affects Versions: 3.0.0 Reporter: Maciej Szymkiewicz As of 3.0.0 higher order functions are available in SQL and Scala, but not in SparkR forcing R users to invoke these through {{expr}}, {{selectExpr}} or {{sql}}. It would be great if Spark provided high level wrappers that accept plain R functions operating on SQL expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30681) Add higher order functions API to PySpark
Maciej Szymkiewicz created SPARK-30681: -- Summary: Add higher order functions API to PySpark Key: SPARK-30681 URL: https://issues.apache.org/jira/browse/SPARK-30681 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Maciej Szymkiewicz As of 3.0.0 higher order functions are available in SQL and Scala, but not in PySpark, forcing Python users to invoke these through {{expr}}, {{selectExpr}} or {{sql}}. This is error prone and not well documented. Spark should provide {{pyspark.sql}} wrappers that accept plain Python functions (of course within limits of {{(*Column) -> Column}}) as arguments. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30680) ResolvedNamespace does not require a namespace catalog
Wenchen Fan created SPARK-30680: --- Summary: ResolvedNamespace does not require a namespace catalog Key: SPARK-30680 URL: https://issues.apache.org/jira/browse/SPARK-30680 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026612#comment-17026612 ] Abhishek Rao commented on SPARK-30619: -- Here are the 2 exceptions SLF4J: Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetMethodRecursive(Class.java:3048) at java.lang.Class.getMethod0(Class.java:3018) at java.lang.Class.getMethod(Class.java:1784) at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526) Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 7 more Commons.collections Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/collections/map/ReferenceMap at org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:58) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:302) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:185) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257) at org.apache.spark.SparkContext.(SparkContext.scala:424) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at org.apache.spark.examples.JavaWordCount.main(JavaWordCount.java:28) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:850) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:925) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:934) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.commons.collections.map.ReferenceMap at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 19 more > org.slf4j.Logger and org.apache.commons.collections classes not built as part > of hadoop-provided profile > > > Key: SPARK-30619 > URL: https://issues.apache.org/jira/browse/SPARK-30619 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.2, 2.4.4 > Environment: Spark on kubernetes >Reporter: Abhishek Rao >Priority: Major > > We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count > (org.apache.spark.examples.JavaWordCount) example on local files. > But we're seeing that it is expecting org.slf4j.Logger and > org.apache.commons.collections classes to be available for executing this. > We expected the binary to work as it is for local files. Is there anything > which we're missing? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30679) REPLACE TABLE can omit the USING clause
Wenchen Fan created SPARK-30679: --- Summary: REPLACE TABLE can omit the USING clause Key: SPARK-30679 URL: https://issues.apache.org/jira/browse/SPARK-30679 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30674) Use python3 in dev/lint-python
[ https://issues.apache.org/jira/browse/SPARK-30674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30674. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27394 [https://github.com/apache/spark/pull/27394] > Use python3 in dev/lint-python > -- > > Key: SPARK-30674 > URL: https://issues.apache.org/jira/browse/SPARK-30674 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > > `lint-python` fails at python2. We had better use python3 explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30674) Use python3 in dev/lint-python
[ https://issues.apache.org/jira/browse/SPARK-30674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30674: - Assignee: Dongjoon Hyun > Use python3 in dev/lint-python > -- > > Key: SPARK-30674 > URL: https://issues.apache.org/jira/browse/SPARK-30674 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > `lint-python` fails at python2. We had better use python3 explicitly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-30678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30678: --- Component/s: (was: SQL) MLlib > Eliminate warnings from deprecated BisectingKMeansModel.computeCost > --- > > Key: SPARK-30678 > URL: https://issues.apache.org/jira/browse/SPARK-30678 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > > The computeCost() method of the BisectingKMeansModel class has been > deprecated. That causes warnings while compiling BisectingKMeansSuite: > {code} > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] assert(model.computeCost(dataset) < 0.1) > [warn] ^ > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] assert(model.computeCost(dataset) == summary.trainingCost) > [warn] ^ > [warn] > /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323: > method computeCost in class BisectingKMeansModel is deprecated (since > 3.0.0): This method is deprecated and will be removed in future versions. Use > ClusteringEvaluator instead. You can also get the cost on the training > dataset in the summary. > [warn] model.computeCost(dataset) > [warn] ^ > [warn] three warnings found > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30678) Eliminate warnings from deprecated BisectingKMeansModel.computeCost
Maxim Gekk created SPARK-30678: -- Summary: Eliminate warnings from deprecated BisectingKMeansModel.computeCost Key: SPARK-30678 URL: https://issues.apache.org/jira/browse/SPARK-30678 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The computeCost() method of the BisectingKMeansModel class has been deprecated. That causes warnings while compiling BisectingKMeansSuite: {code} [warn] /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:108: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary. [warn] assert(model.computeCost(dataset) < 0.1) [warn] ^ [warn] /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:135: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary. [warn] assert(model.computeCost(dataset) == summary.trainingCost) [warn] ^ [warn] /Users/maxim/proj/kmeans-computeCost-warning/mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala:323: method computeCost in class BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated and will be removed in future versions. Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary. [warn] model.computeCost(dataset) [warn] ^ [warn] three warnings found {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-28897) Invalid usage of '*' in expression 'coalesce' error when executing dataframe.na.fill(0)
[ https://issues.apache.org/jira/browse/SPARK-28897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-28897. - > Invalid usage of '*' in expression 'coalesce' error when executing > dataframe.na.fill(0) > --- > > Key: SPARK-28897 > URL: https://issues.apache.org/jira/browse/SPARK-28897 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saurabh Santhosh >Priority: Major > > Getting the following error when trying to execute the given statements > > {code:java} > var df = spark.sql(s"select * from default.test_table") > df.na.fill(0) > {code} > This error happens when the following property is set > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > Error : > {code:java} > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'coalesce'; at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1021) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:997) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:997) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:982) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:977) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:977) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:905) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:900) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOpera
[jira] [Resolved] (SPARK-28897) Invalid usage of '*' in expression 'coalesce' error when executing dataframe.na.fill(0)
[ https://issues.apache.org/jira/browse/SPARK-28897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28897. --- Resolution: Duplicate > Invalid usage of '*' in expression 'coalesce' error when executing > dataframe.na.fill(0) > --- > > Key: SPARK-28897 > URL: https://issues.apache.org/jira/browse/SPARK-28897 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saurabh Santhosh >Priority: Major > > Getting the following error when trying to execute the given statements > > {code:java} > var df = spark.sql(s"select * from default.test_table") > df.na.fill(0) > {code} > This error happens when the following property is set > {code:java} > spark.sql("set spark.sql.parser.quotedRegexColumnNames=true") > {code} > Error : > {code:java} > org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression > 'coalesce'; at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:1021) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$expandStarExpression$1.applyOrElse(Analyzer.scala:997) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.expandStarExpression(Analyzer.scala:997) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:982) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList$1.apply(Analyzer.scala:977) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$buildExpandedProjectList(Analyzer.scala:977) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:905) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:900) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.lo
[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026554#comment-17026554 ] Gabor Somogyi commented on SPARK-12312: --- [~quarkonium] As an initial step the PR what I'm working on will introduce a docker test for postgres to see if later DB versions break something. It could be done for all the other DBs to clean this area up... > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30667) Support simple all gather in barrier task context
[ https://issues.apache.org/jira/browse/SPARK-30667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026545#comment-17026545 ] Sarth Frey commented on SPARK-30667: [https://github.com/apache/spark/pull/27395] > Support simple all gather in barrier task context > - > > Key: SPARK-30667 > URL: https://issues.apache.org/jira/browse/SPARK-30667 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Currently we offer task.barrier() to coordinate tasks in barrier mode. Tasks > can see all IP addresses from BarrierTaskContext. It would be simpler to > integrate with distributed frameworks like TensorFlow DistributionStrategy if > we provide all gather that can let tasks share additional information with > others, e.g., an available port. > Note that with all gather, tasks are share their IP addresses as well. > {code} > port = ... # get an available port > ports = context.all_gather(port) # get all available ports, ordered by task ID > ... # set up distributed training service > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29670) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29670: -- Fix Version/s: (was: 3.0.0) > Make executor's bindAddress configurable > > > Key: SPARK-29670 > URL: https://issues.apache.org/jira/browse/SPARK-29670 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4 >Reporter: Nishchal Venkataramana >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-29670) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-29670. - > Make executor's bindAddress configurable > > > Key: SPARK-29670 > URL: https://issues.apache.org/jira/browse/SPARK-29670 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4 >Reporter: Nishchal Venkataramana >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24203: -- Labels: (was: bulk-closed) > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Assignee: Nishchal Venkataramana >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running
[ https://issues.apache.org/jira/browse/SPARK-30677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mullaivendhan Ariaputhri updated SPARK-30677: - Attachment: Cluster-Config-P1.JPG Instance-Config-P2.JPG Instance-Config-P1.JPG Cluster-Config-P2.JPG > Spark Streaming Job stuck when Kinesis Shard is increased when the job is > running > - > > Key: SPARK-30677 > URL: https://issues.apache.org/jira/browse/SPARK-30677 > Project: Spark > Issue Type: Bug > Components: Block Manager, Structured Streaming >Affects Versions: 2.4.3 >Reporter: Mullaivendhan Ariaputhri >Priority: Major > Attachments: Cluster-Config-P1.JPG, Cluster-Config-P2.JPG, > Instance-Config-P1.JPG, Instance-Config-P2.JPG > > > Spark job stopped processing when the number of shards is increased when the > job is already running. > We have observed the below exceptions. > > 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - > Failed to write to write ahead log > 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - > Failed to write to write ahead log > 2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - > Failed to write to write ahead log after 3 failures > 2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog > Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 > lim=1845 cap=1845],1580107349095,Future())) > java.io.IOException: Not supported > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588) > at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) > at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) > at > org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) > at > org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) > at java.lang.Thread.run(Thread.java:748) > 2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while > writing record: > BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769 > to the WriteAheadLog. > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) > at > org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Not supported > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSys
[jira] [Updated] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running
[ https://issues.apache.org/jira/browse/SPARK-30677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mullaivendhan Ariaputhri updated SPARK-30677: - Description: Spark job stopped processing when the number of shards is increased when the job is already running. We have observed the below exceptions. 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - Failed to write to write ahead log 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - Failed to write to write ahead log 2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - Failed to write to write ahead log after 3 failures 2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 lim=1845 cap=1845],1580107349095,Future())) java.io.IOException: Not supported at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) at org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) at org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) at java.lang.Thread.run(Thread.java:748) 2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while writing record: BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769 to the WriteAheadLog. org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) at org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Not supported at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)
[jira] [Created] (SPARK-30677) Spark Streaming Job stuck when Kinesis Shard is increased when the job is running
Mullaivendhan Ariaputhri created SPARK-30677: Summary: Spark Streaming Job stuck when Kinesis Shard is increased when the job is running Key: SPARK-30677 URL: https://issues.apache.org/jira/browse/SPARK-30677 Project: Spark Issue Type: Bug Components: Block Manager, Structured Streaming Affects Versions: 2.4.3 Reporter: Mullaivendhan Ariaputhri Spark job stopped processing when the number of shards is increased when the job is already running. We have observed the below exceptions. 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - Failed to write to write ahead log 2020-01-27 06:42:29 WARN FileBasedWriteAheadLog_ReceivedBlockTracker:66 - Failed to write to write ahead log 2020-01-27 06:42:29 ERROR FileBasedWriteAheadLog_ReceivedBlockTracker:70 - Failed to write to write ahead log after 3 failures 2020-01-27 06:42:29 WARN BatchedWriteAheadLog:87 - BatchedWriteAheadLog Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 lim=1845 cap=1845],1580107349095,Future())) java.io.IOException: Not supported at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94) at org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50) at org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175) at org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142) at java.lang.Thread.run(Thread.java:748) 2020-01-27 06:42:29 WARN ReceivedBlockTracker:87 - Exception thrown while writing record: BlockAdditionEvent(ReceivedBlockInfo(0,Some(36),Some(SequenceNumberRanges(SequenceNumberRange(XXX,shardId-0006,49603657998853972269624727295162770770442241924489281634,49603657998853972269624727295206292099948368574778703970,36))),WriteAheadLogBasedStoreResult(input-0-1580106915391,Some(36),FileBasedWriteAheadLogSegment(s3://XXX/spark/checkpoint/XX/XXX/receivedData/0/log-1580107349000-1580107409000,0,31769 to the WriteAheadLog. org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89) at org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Not supported at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.append(S3NativeFileSystem.java:588) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295) at org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadL
[jira] [Created] (SPARK-30676) Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double
Maxim Gekk created SPARK-30676: -- Summary: Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double Key: SPARK-30676 URL: https://issues.apache.org/jira/browse/SPARK-30676 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Maxim Gekk The constructors of java.lang.Integer and java.lang.Double has been deprecated already, see [https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html]. The following warnings are printed while compiling Spark: {code} 1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information. 4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. 5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information. {code} The ticket aims to replace the constructors by the valueOf methods, or maybe by other ways. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org