[jira] [Commented] (SPARK-27196) Beginning offset 115204574 is after the ending offset 115204516 for topic
[ https://issues.apache.org/jira/browse/SPARK-27196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224470#comment-17224470 ] Yu Wang commented on SPARK-27196: - https://issues.apache.org/jira/browse/KAFKA-8124 > Beginning offset 115204574 is after the ending offset 115204516 for topic > -- > > Key: SPARK-27196 > URL: https://issues.apache.org/jira/browse/SPARK-27196 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 > Environment: Spark : 2.3.0 > Sparks Kafka: spark-streaming-kafka-0-10_2.3.0 > Kafka Client: org.apache.kafka.kafka-clients: 0.11.0.1 >Reporter: Prasanna Talakanti >Priority: Major > > We are getting this issue in production and Sparks consumer dying because of > Off Set issue. > We observed the following error in Kafka Broker > -- > [2019-03-18 14:40:14,100] WARN Unable to reconnect to ZooKeeper service, > session 0x1692e9ff4410004 has expired (org.apache.zookeeper.ClientCnxn) > [2019-03-18 14:40:14,100] INFO Unable to reconnect to ZooKeeper service, > session 0x1692e9ff4410004 has expired, closing socket connection > (org.apache.zook > eeper.ClientCnxn) > --- > > Sparks Job died with the following error: > ERROR 2019-03-18 07:40:57,178 7924 org.apache.spark.executor.Executor > [Executor task launch worker for task 16] Exception in task 27.0 in stage 0.0 > (TID 16) > java.lang.AssertionError: assertion failed: Beginning offset 115204574 is > after the ending offset 115204516 for topic partition 37. You > either provided an invalid fromOffset, or the Kafka topic has been damaged > at scala.Predef$.assert(Predef.scala:170) > at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:175) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33315) simplify CaseWhen with EqualTo
[ https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224453#comment-17224453 ] Apache Spark commented on SPARK-33315: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30222 > simplify CaseWhen with EqualTo > -- > > Key: SPARK-33315 > URL: https://issues.apache.org/jira/browse/SPARK-33315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > create table t(a int, b int, c int) using parquet; > SELECT * > FROM (SELECT CASE > WHEN a = 100 THEN 1 > WHEN b > 1000 THEN 2 > WHEN c IS NOT NULL THEN 3 >END AS x > FROM t) tmp > WHERE x = 2 > {code} > {noformat} > Before this PR: > Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN > isnotnull(c#3) THEN 3 END = 2) > After this PR: > Filter (b#2 > 1000) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33315) simplify CaseWhen with EqualTo
[ https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33315: Assignee: (was: Apache Spark) > simplify CaseWhen with EqualTo > -- > > Key: SPARK-33315 > URL: https://issues.apache.org/jira/browse/SPARK-33315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > create table t(a int, b int, c int) using parquet; > SELECT * > FROM (SELECT CASE > WHEN a = 100 THEN 1 > WHEN b > 1000 THEN 2 > WHEN c IS NOT NULL THEN 3 >END AS x > FROM t) tmp > WHERE x = 2 > {code} > {noformat} > Before this PR: > Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN > isnotnull(c#3) THEN 3 END = 2) > After this PR: > Filter (b#2 > 1000) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33315) simplify CaseWhen with EqualTo
[ https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224452#comment-17224452 ] Apache Spark commented on SPARK-33315: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/30222 > simplify CaseWhen with EqualTo > -- > > Key: SPARK-33315 > URL: https://issues.apache.org/jira/browse/SPARK-33315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {code:sql} > create table t(a int, b int, c int) using parquet; > SELECT * > FROM (SELECT CASE > WHEN a = 100 THEN 1 > WHEN b > 1000 THEN 2 > WHEN c IS NOT NULL THEN 3 >END AS x > FROM t) tmp > WHERE x = 2 > {code} > {noformat} > Before this PR: > Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN > isnotnull(c#3) THEN 3 END = 2) > After this PR: > Filter (b#2 > 1000) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33315) simplify CaseWhen with EqualTo
[ https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33315: Assignee: Apache Spark > simplify CaseWhen with EqualTo > -- > > Key: SPARK-33315 > URL: https://issues.apache.org/jira/browse/SPARK-33315 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:sql} > create table t(a int, b int, c int) using parquet; > SELECT * > FROM (SELECT CASE > WHEN a = 100 THEN 1 > WHEN b > 1000 THEN 2 > WHEN c IS NOT NULL THEN 3 >END AS x > FROM t) tmp > WHERE x = 2 > {code} > {noformat} > Before this PR: > Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN > isnotnull(c#3) THEN 3 END = 2) > After this PR: > Filter (b#2 > 1000) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33315) simplify CaseWhen with EqualTo
Yuming Wang created SPARK-33315: --- Summary: simplify CaseWhen with EqualTo Key: SPARK-33315 URL: https://issues.apache.org/jira/browse/SPARK-33315 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang {code:sql} create table t(a int, b int, c int) using parquet; SELECT * FROM (SELECT CASE WHEN a = 100 THEN 1 WHEN b > 1000 THEN 2 WHEN c IS NOT NULL THEN 3 END AS x FROM t) tmp WHERE x = 2 {code} {noformat} Before this PR: Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN isnotnull(c#3) THEN 3 END = 2) After this PR: Filter (b#2 > 1000) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output
[ https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224446#comment-17224446 ] Jungtaek Lim commented on SPARK-33259: -- I'll also link to the first JIRA issue which the problem was pointed out earlier. > Joining 3 streams results in incorrect output > - > > Key: SPARK-33259 > URL: https://issues.apache.org/jira/browse/SPARK-33259 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.1 >Reporter: Michael >Priority: Critical > Labels: correctness > > I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN > B) INNER JOIN C) operation. Below you can see example code I [posted on > Stackoverflow|https://stackoverflow.com/questions/64503539/]... > I created a minimal example of "sessions", that have "start" and "end" events > and optionally some "metadata". > The script generates two outputs: {{sessionStartsWithMetadata}} result from > "start" events that are left-joined with the "metadata" events, based on > {{sessionId}}. A "left join" is used, since we like to get an output event > even when no corresponding metadata exists. > Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining > "end" events to the previously created DataFrame. Here an "inner join" is > used, since we only want some output when a session has ended for sure. > This code can be executed in {{spark-shell}}: > {code:scala} > import java.sql.Timestamp > import org.apache.spark.sql.execution.streaming.{MemoryStream, > StreamingQueryWrapper} > import org.apache.spark.sql.streaming.StreamingQuery > import org.apache.spark.sql.{DataFrame, SQLContext} > import org.apache.spark.sql.functions.{col, expr, lit} > import spark.implicits._ > implicit val sqlContext: SQLContext = spark.sqlContext > // Main data processing, regardless whether batch or stream processing > def process( > sessionStartEvents: DataFrame, > sessionOptionalMetadataEvents: DataFrame, > sessionEndEvents: DataFrame > ): (DataFrame, DataFrame) = { > val sessionStartsWithMetadata: DataFrame = sessionStartEvents > .join( > sessionOptionalMetadataEvents, > sessionStartEvents("sessionId") === > sessionOptionalMetadataEvents("sessionId") && > sessionStartEvents("sessionStartTimestamp").between( > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL > 1 seconds")), > > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL > 1 seconds")) > ), > "left" // metadata is optional > ) > .select( > sessionStartEvents("sessionId"), > sessionStartEvents("sessionStartTimestamp"), > sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp") > ) > val endedSessionsWithMetadata = sessionStartsWithMetadata.join( > sessionEndEvents, > sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") > && > sessionStartsWithMetadata("sessionStartTimestamp").between( > sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 > seconds")), > sessionEndEvents("sessionEndTimestamp") > ) > ) > (sessionStartsWithMetadata, endedSessionsWithMetadata) > } > def streamProcessing( > sessionStartData: Seq[(Timestamp, Int)], > sessionOptionalMetadata: Seq[(Timestamp, Int)], > sessionEndData: Seq[(Timestamp, Int)] > ): (StreamingQuery, StreamingQuery) = { > val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionStartEventsStream.addData(sessionStartData) > val sessionStartEvents: DataFrame = sessionStartEventsStream > .toDS() > .toDF("sessionStartTimestamp", "sessionId") > .withWatermark("sessionStartTimestamp", "1 second") > val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata) > val sessionOptionalMetadataEvents: DataFrame = > sessionOptionalMetadataEventsStream > .toDS() > .toDF("sessionOptionalMetadataTimestamp", "sessionId") > .withWatermark("sessionOptionalMetadataTimestamp", "1 second") > val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = > MemoryStream[(Timestamp, Int)] > sessionEndEventsStream.addData(sessionEndData) > val sessionEndEvents: DataFrame = sessionEndEventsStream > .toDS() > .toDF("sessionEndTimestamp", "sessionId") > .withWatermark("sessionEndTimestamp", "1 second") > val (sessionStartsWithMetadata, endedSessionsWithMetadata) = > process(sessionStartEvents, sessionOptionalMetadataEvents, > sessionEndEvents) > val sessionStartsWithMet
[jira] [Commented] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs
[ https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224445#comment-17224445 ] Saheb Kanodia commented on SPARK-12567: --- [~wuhao] [~vectorijk] This seems like a very common feature in other systems and in-demand by enterprises. Can you please re-open this? > Add aes_encrypt and aes_decrypt UDFs > > > Key: SPARK-12567 > URL: https://issues.apache.org/jira/browse/SPARK-12567 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Kai Jiang >Assignee: Kai Jiang >Priority: Major > Labels: bulk-closed > > AES (Advanced Encryption Standard) algorithm. > Add aes_encrypt and aes_decrypt UDFs. > Ref: > [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions] > [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33027) Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules
[ https://issues.apache.org/jira/browse/SPARK-33027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33027: --- Assignee: Cheng Su > Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules > - > > Key: SPARK-33027 > URL: https://issues.apache.org/jira/browse/SPARK-33027 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > > As a followup comment from > [https://github.com/apache/spark/pull/29804#issuecomment-700650620] , will > add the physical plan rule DisableUnnecessaryBucketedScan into AQE > AdaptiveSparkPlanExec.queryStagePreparationRules. This mostly needs some more > unit tests crafting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33027) Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules
[ https://issues.apache.org/jira/browse/SPARK-33027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33027. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30200 [https://github.com/apache/spark/pull/30200] > Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules > - > > Key: SPARK-33027 > URL: https://issues.apache.org/jira/browse/SPARK-33027 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > As a followup comment from > [https://github.com/apache/spark/pull/29804#issuecomment-700650620] , will > add the physical plan rule DisableUnnecessaryBucketedScan into AQE > AdaptiveSparkPlanExec.queryStagePreparationRules. This mostly needs some more > unit tests crafting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency
[ https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224435#comment-17224435 ] Liya Fan commented on SPARK-33279: -- [~bryanc] Thanks a lot for your effort. > Spark 3.0 failure due to lack of Arrow dependency > - > > Key: SPARK-33279 > URL: https://issues.apache.org/jira/browse/SPARK-33279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > A recent change in Arrow has split the arrow-memory module into 3, so client > code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe). > This has been done in the master branch of Spark, but not in the branch-3.0 > branch, this is causing the build in branch-3.0 to fail > (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency
[ https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224420#comment-17224420 ] Bryan Cutler edited comment on SPARK-33279 at 11/2/20, 5:21 AM: [~fan_li_ya] we should change the Arrow-Spark integration tests so that it doesn't try to build with the latest Arrow Java, and instead just test the latest pyarrow, which should work. I made ARROW-10457 for this. was (Author: bryanc): [~fan_li_ya] we should change the Arrow-Spark integration tests so that it doesn't try to build with the latest Arrow Java, and instead just test the latest pyarrow, which should work. > Spark 3.0 failure due to lack of Arrow dependency > - > > Key: SPARK-33279 > URL: https://issues.apache.org/jira/browse/SPARK-33279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > A recent change in Arrow has split the arrow-memory module into 3, so client > code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe). > This has been done in the master branch of Spark, but not in the branch-3.0 > branch, this is causing the build in branch-3.0 to fail > (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency
[ https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224420#comment-17224420 ] Bryan Cutler commented on SPARK-33279: -- [~fan_li_ya] we should change the Arrow-Spark integration tests so that it doesn't try to build with the latest Arrow Java, and instead just test the latest pyarrow, which should work. > Spark 3.0 failure due to lack of Arrow dependency > - > > Key: SPARK-33279 > URL: https://issues.apache.org/jira/browse/SPARK-33279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > A recent change in Arrow has split the arrow-memory module into 3, so client > code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe). > This has been done in the master branch of Spark, but not in the branch-3.0 > branch, this is causing the build in branch-3.0 to fail > (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33314) Avro reader drops rows
[ https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-33314: - Priority: Blocker (was: Major) > Avro reader drops rows > -- > > Key: SPARK-33314 > URL: https://issues.apache.org/jira/browse/SPARK-33314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Blocker > Labels: correctness > > Under certain circumstances, the V1 Avro reader drops rows. For example: > {noformat} > scala> val df = spark.range(0, 25).toDF("index") > df: org.apache.spark.sql.DataFrame = [index: bigint] > scala> df.write.mode("overwrite").format("avro").save("index_avro") > scala> val loaded = spark.read.format("avro").load("index_avro") > loaded: org.apache.spark.sql.DataFrame = [index: bigint] > scala> loaded.collect.size > res1: Int = 25 > scala> loaded.orderBy("index").collect.size > res2: Int = 17 <== expected 25 > scala> > loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet") > scala> spark.read.parquet("index_as_parquet").count > res4: Long = 17 > scala> > {noformat} > SPARK-32346 slightly refactored the AvroFileFormat and > AvroPartitionReaderFactory to use a new iterator-like trait called > AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and > stores the deserialized row for the next call to RowReader#nextRow. > Unfortunately, sometimes hasNextRow is called twice before nextRow is called, > resulting in a lost row (see > [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132], > which calls records.hasNext once before calling it again > [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]). > RowReader consumes the Avro record in hasNextRow, rather than nextRow, > because AvroDeserializer#deserialize potentially filters out the record. > Two possible fixes that I thought of: > 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow > with no intervening call to RowReader#nextRow avoids consuming more than 1 > Avro record. This requires no changes to any code that extends RowReader, > just RowReader itself. > 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow > could potentially return None) and wrap any iterator that extends RowReader > with a new iterator created by flatMap. This last iterator will filter out > the Nones and extract rows from the Somes. This requires changes to > AvroFileFormat and AvroPartitionReaderFactory as well as RowReader. > The first one seems simplest and most straightfoward, and doesn't require > changes to AvroFileFormat and AvroPartitionReaderFactory, only to > AvroUtils#RowReader. So I propose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33313: - Fix Version/s: 3.0.2 2.4.8 > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224379#comment-17224379 ] Arya Ketan commented on SPARK-24432: With acceptance of https://issues.apache.org/jira/browse/SPARK-30602, how is the design for this evolving? > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql
[ https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224377#comment-17224377 ] Hyukjin Kwon commented on SPARK-33273: -- cc [~yumwang] do you have any clue? > Fix Flaky Test: ThriftServerQueryTestSuite. > subquery_scalar_subquery_scalar_subquery_select_sql > --- > > Key: SPARK-33273 > URL: https://issues.apache.org/jira/browse/SPARK-33273 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Blocker > Labels: correctness > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/ > {code} > [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** > (3 seconds, 877 milliseconds) > [info] Expected "[1]0 2017-05-04 01:01:0...", but got "[]0 > 2017-05-04 01:01:0..." Result did not match for query #3 > [info] SELECT (SELECT min(t3d) FROM t3) min_t3d, > [info] (SELECT max(t2h) FROM t2) max_t2h > [info] FROM t1 > [info] WHERE t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33314) Avro reader drops rows
[ https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33314: Assignee: (was: Apache Spark) > Avro reader drops rows > -- > > Key: SPARK-33314 > URL: https://issues.apache.org/jira/browse/SPARK-33314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Under certain circumstances, the V1 Avro reader drops rows. For example: > {noformat} > scala> val df = spark.range(0, 25).toDF("index") > df: org.apache.spark.sql.DataFrame = [index: bigint] > scala> df.write.mode("overwrite").format("avro").save("index_avro") > scala> val loaded = spark.read.format("avro").load("index_avro") > loaded: org.apache.spark.sql.DataFrame = [index: bigint] > scala> loaded.collect.size > res1: Int = 25 > scala> loaded.orderBy("index").collect.size > res2: Int = 17 <== expected 25 > scala> > loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet") > scala> spark.read.parquet("index_as_parquet").count > res4: Long = 17 > scala> > {noformat} > SPARK-32346 slightly refactored the AvroFileFormat and > AvroPartitionReaderFactory to use a new iterator-like trait called > AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and > stores the deserialized row for the next call to RowReader#nextRow. > Unfortunately, sometimes hasNextRow is called twice before nextRow is called, > resulting in a lost row (see > [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132], > which calls records.hasNext once before calling it again > [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]). > RowReader consumes the Avro record in hasNextRow, rather than nextRow, > because AvroDeserializer#deserialize potentially filters out the record. > Two possible fixes that I thought of: > 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow > with no intervening call to RowReader#nextRow avoids consuming more than 1 > Avro record. This requires no changes to any code that extends RowReader, > just RowReader itself. > 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow > could potentially return None) and wrap any iterator that extends RowReader > with a new iterator created by flatMap. This last iterator will filter out > the Nones and extract rows from the Somes. This requires changes to > AvroFileFormat and AvroPartitionReaderFactory as well as RowReader. > The first one seems simplest and most straightfoward, and doesn't require > changes to AvroFileFormat and AvroPartitionReaderFactory, only to > AvroUtils#RowReader. So I propose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33314) Avro reader drops rows
[ https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33314: Assignee: Apache Spark > Avro reader drops rows > -- > > Key: SPARK-33314 > URL: https://issues.apache.org/jira/browse/SPARK-33314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Major > Labels: correctness > > Under certain circumstances, the V1 Avro reader drops rows. For example: > {noformat} > scala> val df = spark.range(0, 25).toDF("index") > df: org.apache.spark.sql.DataFrame = [index: bigint] > scala> df.write.mode("overwrite").format("avro").save("index_avro") > scala> val loaded = spark.read.format("avro").load("index_avro") > loaded: org.apache.spark.sql.DataFrame = [index: bigint] > scala> loaded.collect.size > res1: Int = 25 > scala> loaded.orderBy("index").collect.size > res2: Int = 17 <== expected 25 > scala> > loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet") > scala> spark.read.parquet("index_as_parquet").count > res4: Long = 17 > scala> > {noformat} > SPARK-32346 slightly refactored the AvroFileFormat and > AvroPartitionReaderFactory to use a new iterator-like trait called > AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and > stores the deserialized row for the next call to RowReader#nextRow. > Unfortunately, sometimes hasNextRow is called twice before nextRow is called, > resulting in a lost row (see > [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132], > which calls records.hasNext once before calling it again > [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]). > RowReader consumes the Avro record in hasNextRow, rather than nextRow, > because AvroDeserializer#deserialize potentially filters out the record. > Two possible fixes that I thought of: > 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow > with no intervening call to RowReader#nextRow avoids consuming more than 1 > Avro record. This requires no changes to any code that extends RowReader, > just RowReader itself. > 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow > could potentially return None) and wrap any iterator that extends RowReader > with a new iterator created by flatMap. This last iterator will filter out > the Nones and extract rows from the Somes. This requires changes to > AvroFileFormat and AvroPartitionReaderFactory as well as RowReader. > The first one seems simplest and most straightfoward, and doesn't require > changes to AvroFileFormat and AvroPartitionReaderFactory, only to > AvroUtils#RowReader. So I propose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33314) Avro reader drops rows
[ https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224367#comment-17224367 ] Apache Spark commented on SPARK-33314: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/30221 > Avro reader drops rows > -- > > Key: SPARK-33314 > URL: https://issues.apache.org/jira/browse/SPARK-33314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Under certain circumstances, the V1 Avro reader drops rows. For example: > {noformat} > scala> val df = spark.range(0, 25).toDF("index") > df: org.apache.spark.sql.DataFrame = [index: bigint] > scala> df.write.mode("overwrite").format("avro").save("index_avro") > scala> val loaded = spark.read.format("avro").load("index_avro") > loaded: org.apache.spark.sql.DataFrame = [index: bigint] > scala> loaded.collect.size > res1: Int = 25 > scala> loaded.orderBy("index").collect.size > res2: Int = 17 <== expected 25 > scala> > loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet") > scala> spark.read.parquet("index_as_parquet").count > res4: Long = 17 > scala> > {noformat} > SPARK-32346 slightly refactored the AvroFileFormat and > AvroPartitionReaderFactory to use a new iterator-like trait called > AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and > stores the deserialized row for the next call to RowReader#nextRow. > Unfortunately, sometimes hasNextRow is called twice before nextRow is called, > resulting in a lost row (see > [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132], > which calls records.hasNext once before calling it again > [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]). > RowReader consumes the Avro record in hasNextRow, rather than nextRow, > because AvroDeserializer#deserialize potentially filters out the record. > Two possible fixes that I thought of: > 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow > with no intervening call to RowReader#nextRow avoids consuming more than 1 > Avro record. This requires no changes to any code that extends RowReader, > just RowReader itself. > 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow > could potentially return None) and wrap any iterator that extends RowReader > with a new iterator created by flatMap. This last iterator will filter out > the Nones and extract rows from the Somes. This requires changes to > AvroFileFormat and AvroPartitionReaderFactory as well as RowReader. > The first one seems simplest and most straightfoward, and doesn't require > changes to AvroFileFormat and AvroPartitionReaderFactory, only to > AvroUtils#RowReader. So I propose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33282) Replace Probot Autolabeler with Github Action
[ https://issues.apache.org/jira/browse/SPARK-33282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224366#comment-17224366 ] Hyukjin Kwon commented on SPARK-33282: -- Yeah, I noticed this but couldn't have time to take a look. Thanks for working on this. cc [~dongjoon] as well FYI. > Replace Probot Autolabeler with Github Action > - > > Key: SPARK-33282 > URL: https://issues.apache.org/jira/browse/SPARK-33282 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.0.1 >Reporter: Kyle Bendickson >Priority: Major > > The Probot Autolabeler that we were using in both the Iceberg and the Spark > repo is no longer working. I've confirmed that with the devleper, github user > [at]mithro, who has indicated that the Probot Autolabeler is end of life and > will not be maintained moving forward. > PRs have not been labeled for a few weeks now. > > As I'm already interfacing with ASF Infra to have the probot permissions > revoked from the Iceberg repo, and I've already submitted a patch to switch > Iceberg to the standard github labeler action, I figured I would go ahead and > volunteer myself to switch the Spark repo as well. > I will have a patch to switch to the new github labeler open within a few > days. > > Also thank you [~blue] (or [~holden]) for shepherding this! I didn't exactly > ask, but it was understood in our group meeting for Iceberg that I'd be > converting our labeler there so I figured I'd tackle the spark issue while > I'm getting my hands into the labeling configs anyway =) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33289) Error executing `SHOW VIEWS in {database}` using JDBC
[ https://issues.apache.org/jira/browse/SPARK-33289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33289. -- Resolution: Invalid > Error executing `SHOW VIEWS in {database}` using JDBC > - > > Key: SPARK-33289 > URL: https://issues.apache.org/jira/browse/SPARK-33289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Prakshal Jain >Priority: Major > > Full description: > [https://stackoverflow.com/questions/60742577/spark-sql-query-show-views-in-through-hive-metastore-fails-with-missing-func] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33289) Error executing `SHOW VIEWS in {database}` using JDBC
[ https://issues.apache.org/jira/browse/SPARK-33289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224365#comment-17224365 ] Hyukjin Kwon commented on SPARK-33289: -- This seems an issue in Tableau that creates a query wrong. Also, please provide a reproducer together next time. > Error executing `SHOW VIEWS in {database}` using JDBC > - > > Key: SPARK-33289 > URL: https://issues.apache.org/jira/browse/SPARK-33289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Prakshal Jain >Priority: Major > > Full description: > [https://stackoverflow.com/questions/60742577/spark-sql-query-show-views-in-through-hive-metastore-fails-with-missing-func] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33312) Provide latest Spark 2.4.7 runnable distribution
[ https://issues.apache.org/jira/browse/SPARK-33312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224364#comment-17224364 ] Hyukjin Kwon commented on SPARK-33312: -- Providing that option is pretty much an overhead to update the site - the site is static site which needs a manual update. Can you use snapshots instead? https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/ > Provide latest Spark 2.4.7 runnable distribution > > > Key: SPARK-33312 > URL: https://issues.apache.org/jira/browse/SPARK-33312 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.7 >Reporter: Prateek Dubey >Priority: Major > > Not sure if this is the right approach, however it would be great if latest > Spark 2.4.7 runnable distribution can be provided here - > [https://spark.apache.org/downloads.html] > Currently it seems the last build was done on Sept 12th' 2020. > I'm working on running Spark workloads on EKS using EKS IRSA. I'm able to run > Spark workloads on EKS using IRSA with Spark 3.0/ Hadoop 3.2, however I want > to do the same with Spark 2.4.7/ Hadoop 2.7. > Recently this PR was merged with 2.4.x - > [https://github.com/apache/spark/pull/29877] and therefore I'm in need of > latest Spark distribution > > PS: I tried building latest Spark 2.4.7 myself as well using Maven, however > there are too many errors every-time when it reaches R, therefore it would be > great if Spark community itself can provide the latest build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33309) Replace origin GROUPING SETS with new expression
[ https://issues.apache.org/jira/browse/SPARK-33309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33309: - Description: Replace current GroupingSets just using expression {code} case class GroupingSets( selectedGroupByExprs: Seq[Seq[Expression]], groupByExprs: Seq[Expression], child: LogicalPlan, aggregations: Seq[NamedExpression]) extends UnaryNode { override def output: Seq[Attribute] = aggregations.map(_.toAttribute) // Needs to be unresolved before its translated to Aggregate + Expand because output attributes // will change in analysis. override lazy val resolved: Boolean = false } {code} was: Replace current GroupingSets just using expression ``` case class GroupingSets( selectedGroupByExprs: Seq[Seq[Expression]], groupByExprs: Seq[Expression], child: LogicalPlan, aggregations: Seq[NamedExpression]) extends UnaryNode { override def output: Seq[Attribute] = aggregations.map(_.toAttribute) // Needs to be unresolved before its translated to Aggregate + Expand because output attributes // will change in analysis. override lazy val resolved: Boolean = false } ``` > Replace origin GROUPING SETS with new expression > > > Key: SPARK-33309 > URL: https://issues.apache.org/jira/browse/SPARK-33309 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Replace current GroupingSets just using expression > {code} > case class GroupingSets( > selectedGroupByExprs: Seq[Seq[Expression]], > groupByExprs: Seq[Expression], > child: LogicalPlan, > aggregations: Seq[NamedExpression]) extends UnaryNode { > override def output: Seq[Attribute] = aggregations.map(_.toAttribute) > // Needs to be unresolved before its translated to Aggregate + Expand > because output attributes > // will change in analysis. > override lazy val resolved: Boolean = false > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33312) Provide latest Spark 2.4.7 runnable distribution
[ https://issues.apache.org/jira/browse/SPARK-33312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224363#comment-17224363 ] Hyukjin Kwon commented on SPARK-33312: -- Providing that option is pretty much an overhead to update the site - the site is static site which needs a manual update. Can you use snapshots instead? https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/ > Provide latest Spark 2.4.7 runnable distribution > > > Key: SPARK-33312 > URL: https://issues.apache.org/jira/browse/SPARK-33312 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.4.7 >Reporter: Prateek Dubey >Priority: Major > > Not sure if this is the right approach, however it would be great if latest > Spark 2.4.7 runnable distribution can be provided here - > [https://spark.apache.org/downloads.html] > Currently it seems the last build was done on Sept 12th' 2020. > I'm working on running Spark workloads on EKS using EKS IRSA. I'm able to run > Spark workloads on EKS using IRSA with Spark 3.0/ Hadoop 3.2, however I want > to do the same with Spark 2.4.7/ Hadoop 2.7. > Recently this PR was merged with 2.4.x - > [https://github.com/apache/spark/pull/29877] and therefore I'm in need of > latest Spark distribution > > PS: I tried building latest Spark 2.4.7 myself as well using Maven, however > there are too many errors every-time when it reaches R, therefore it would be > great if Spark community itself can provide the latest build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33314) Avro reader drops rows
[ https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-33314: -- Labels: correctness (was: ) > Avro reader drops rows > -- > > Key: SPARK-33314 > URL: https://issues.apache.org/jira/browse/SPARK-33314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Under certain circumstances, the V1 Avro reader drops rows. For example: > {noformat} > scala> val df = spark.range(0, 25).toDF("index") > df: org.apache.spark.sql.DataFrame = [index: bigint] > scala> df.write.mode("overwrite").format("avro").save("index_avro") > scala> val loaded = spark.read.format("avro").load("index_avro") > loaded: org.apache.spark.sql.DataFrame = [index: bigint] > scala> loaded.collect.size > res1: Int = 25 > scala> loaded.orderBy("index").collect.size > res2: Int = 17 <== expected 25 > scala> > loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet") > scala> spark.read.parquet("index_as_parquet").count > res4: Long = 17 > scala> > {noformat} > SPARK-32346 slightly refactored the AvroFileFormat and > AvroPartitionReaderFactory to use a new iterator-like trait called > AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and > stores the deserialized row for the next call to RowReader#nextRow. > Unfortunately, sometimes hasNextRow is called twice before nextRow is called, > resulting in a lost row (see > [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132], > which calls records.hasNext once before calling it again > [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]). > RowReader consumes the Avro record in hasNextRow, rather than nextRow, > because AvroDeserializer#deserialize potentially filters out the record. > Two possible fixes that I thought of: > 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow > with no intervening call to RowReader#nextRow avoids consuming more than 1 > Avro record. This requires no changes to any code that extends RowReader, > just RowReader itself. > 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow > could potentially return None) and wrap any iterator that extends RowReader > with a new iterator created by flatMap. This last iterator will filter out > the Nones and extract rows from the Somes. This requires changes to > AvroFileFormat and AvroPartitionReaderFactory as well as RowReader. > The first one seems simplest and most straightfoward, and doesn't require > changes to AvroFileFormat and AvroPartitionReaderFactory, only to > AvroUtils#RowReader. So I propose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33314) Avro reader drops rows
Bruce Robbins created SPARK-33314: - Summary: Avro reader drops rows Key: SPARK-33314 URL: https://issues.apache.org/jira/browse/SPARK-33314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Bruce Robbins Under certain circumstances, the V1 Avro reader drops rows. For example: {noformat} scala> val df = spark.range(0, 25).toDF("index") df: org.apache.spark.sql.DataFrame = [index: bigint] scala> df.write.mode("overwrite").format("avro").save("index_avro") scala> val loaded = spark.read.format("avro").load("index_avro") loaded: org.apache.spark.sql.DataFrame = [index: bigint] scala> loaded.collect.size res1: Int = 25 scala> loaded.orderBy("index").collect.size res2: Int = 17 <== expected 25 scala> loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet") scala> spark.read.parquet("index_as_parquet").count res4: Long = 17 scala> {noformat} SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row (see [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132], which calls records.hasNext once before calling it again [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]). RowReader consumes the Avro record in hasNextRow, rather than nextRow, because AvroDeserializer#deserialize potentially filters out the record. Two possible fixes that I thought of: 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow avoids consuming more than 1 Avro record. This requires no changes to any code that extends RowReader, just RowReader itself. 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow could potentially return None) and wrap any iterator that extends RowReader with a new iterator created by flatMap. This last iterator will filter out the Nones and extract rows from the Somes. This requires changes to AvroFileFormat and AvroPartitionReaderFactory as well as RowReader. The first one seems simplest and most straightfoward, and doesn't require changes to AvroFileFormat and AvroPartitionReaderFactory, only to AvroUtils#RowReader. So I propose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33277: - Fix Version/s: 3.0.2 2.4.8 > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33277: Assignee: Takuya Ueshin > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224362#comment-17224362 ] Apache Spark commented on SPARK-33313: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30220 > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224361#comment-17224361 ] Apache Spark commented on SPARK-33313: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30220 > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33313. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30219 [https://github.com/apache/spark/pull/30219] > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33313: Assignee: Maciej Szymkiewicz > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x
[ https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30663: Assignee: Maciej Szymkiewicz > Remove 1.x testthat switch once Jenkins version is updated to 2.x > - > > Key: SPARK-30663 > URL: https://issues.apache.org/jira/browse/SPARK-30663 > Project: Spark > Issue Type: Planned Work > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > > As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility > mode > {code} > if (grepl("^1\\..*", packageVersion("testthat"))) { > NULL, # testthat 1.x > "summary") test_runner <- > testthat:::run_tests > reporter <- "summary" > } else { > # testthat >= 2.0.0 > test_runner <- testthat:::test_package_dir > reporter <- testthat::default_reporter() > } > {code} > in {{R/pkg/tests/run-all.R}}. > It should be removed once whole infrastructure uses {{testhat}} 2.x or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x
[ https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30663. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30219 [https://github.com/apache/spark/pull/30219] > Remove 1.x testthat switch once Jenkins version is updated to 2.x > - > > Key: SPARK-30663 > URL: https://issues.apache.org/jira/browse/SPARK-30663 > Project: Spark > Issue Type: Planned Work > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.1.0 > > > As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility > mode > {code} > if (grepl("^1\\..*", packageVersion("testthat"))) { > NULL, # testthat 1.x > "summary") test_runner <- > testthat:::run_tests > reporter <- "summary" > } else { > # testthat >= 2.0.0 > test_runner <- testthat:::test_package_dir > reporter <- testthat::default_reporter() > } > {code} > in {{R/pkg/tests/run-all.R}}. > It should be removed once whole infrastructure uses {{testhat}} 2.x or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224349#comment-17224349 ] Apache Spark commented on SPARK-33313: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30219 > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33313: Assignee: Apache Spark > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x
[ https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30663: Assignee: (was: Apache Spark) > Remove 1.x testthat switch once Jenkins version is updated to 2.x > - > > Key: SPARK-30663 > URL: https://issues.apache.org/jira/browse/SPARK-30663 > Project: Spark > Issue Type: Planned Work > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility > mode > {code} > if (grepl("^1\\..*", packageVersion("testthat"))) { > NULL, # testthat 1.x > "summary") test_runner <- > testthat:::run_tests > reporter <- "summary" > } else { > # testthat >= 2.0.0 > test_runner <- testthat:::test_package_dir > reporter <- testthat::default_reporter() > } > {code} > in {{R/pkg/tests/run-all.R}}. > It should be removed once whole infrastructure uses {{testhat}} 2.x or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224348#comment-17224348 ] Apache Spark commented on SPARK-33313: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30219 > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
[ https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33313: Assignee: (was: Apache Spark) > R/run-tests.sh is not compatible with testthat >= 3.0 > - > > Key: SPARK-33313 > URL: https://issues.apache.org/jira/browse/SPARK-33313 > Project: Spark > Issue Type: Bug > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > Currently we use {{testthat:::test_package_dir}} to run full SparkR tests > with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 > (https://github.com/r-lib/testthat/pull/1054). > Because of that AppVeyor tests fail with > {code:r} > Spark package found in SPARK_HOME: C:\projects\spark\bin\.. > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object 'test_package_dir' not found > Calls: ::: -> get > Execution halted > {code} > It seems like we can use {{testthat::test_dir}} which, since {{testthat}} > 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x
[ https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224347#comment-17224347 ] Apache Spark commented on SPARK-30663: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30219 > Remove 1.x testthat switch once Jenkins version is updated to 2.x > - > > Key: SPARK-30663 > URL: https://issues.apache.org/jira/browse/SPARK-30663 > Project: Spark > Issue Type: Planned Work > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility > mode > {code} > if (grepl("^1\\..*", packageVersion("testthat"))) { > NULL, # testthat 1.x > "summary") test_runner <- > testthat:::run_tests > reporter <- "summary" > } else { > # testthat >= 2.0.0 > test_runner <- testthat:::test_package_dir > reporter <- testthat::default_reporter() > } > {code} > in {{R/pkg/tests/run-all.R}}. > It should be removed once whole infrastructure uses {{testhat}} 2.x or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x
[ https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30663: Assignee: Apache Spark > Remove 1.x testthat switch once Jenkins version is updated to 2.x > - > > Key: SPARK-30663 > URL: https://issues.apache.org/jira/browse/SPARK-30663 > Project: Spark > Issue Type: Planned Work > Components: SparkR, Tests >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility > mode > {code} > if (grepl("^1\\..*", packageVersion("testthat"))) { > NULL, # testthat 1.x > "summary") test_runner <- > testthat:::run_tests > reporter <- "summary" > } else { > # testthat >= 2.0.0 > test_runner <- testthat:::test_package_dir > reporter <- testthat::default_reporter() > } > {code} > in {{R/pkg/tests/run-all.R}}. > It should be removed once whole infrastructure uses {{testhat}} 2.x or later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0
Maciej Szymkiewicz created SPARK-33313: -- Summary: R/run-tests.sh is not compatible with testthat >= 3.0 Key: SPARK-33313 URL: https://issues.apache.org/jira/browse/SPARK-33313 Project: Spark Issue Type: Bug Components: R, Tests Affects Versions: 3.0.0, 3.1.0 Reporter: Maciej Szymkiewicz Currently we use {{testthat:::test_package_dir}} to run full SparkR tests with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 (https://github.com/r-lib/testthat/pull/1054). Because of that AppVeyor tests fail with {code:r} Spark package found in SPARK_HOME: C:\projects\spark\bin\.. Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : object 'test_package_dir' not found Calls: ::: -> get Execution halted {code} It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 3.0, supports {{package}} parameter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224339#comment-17224339 ] Apache Spark commented on SPARK-33277: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/30218 > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224338#comment-17224338 ] Apache Spark commented on SPARK-33277: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/30218 > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224329#comment-17224329 ] Apache Spark commented on SPARK-33277: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/30217 > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224328#comment-17224328 ] Apache Spark commented on SPARK-33277: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/30217 > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33304) Add from_avro and to_avro functions to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224301#comment-17224301 ] Apache Spark commented on SPARK-33304: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30216 > Add from_avro and to_avro functions to SparkR > - > > Key: SPARK-33304 > URL: https://issues.apache.org/jira/browse/SPARK-33304 > Project: Spark > Issue Type: Improvement > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in > 3.0, but are still missing in R API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33304) Add from_avro and to_avro functions to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33304: Assignee: Apache Spark > Add from_avro and to_avro functions to SparkR > - > > Key: SPARK-33304 > URL: https://issues.apache.org/jira/browse/SPARK-33304 > Project: Spark > Issue Type: Improvement > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Major > > {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in > 3.0, but are still missing in R API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33304) Add from_avro and to_avro functions to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224299#comment-17224299 ] Apache Spark commented on SPARK-33304: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/30216 > Add from_avro and to_avro functions to SparkR > - > > Key: SPARK-33304 > URL: https://issues.apache.org/jira/browse/SPARK-33304 > Project: Spark > Issue Type: Improvement > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in > 3.0, but are still missing in R API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33304) Add from_avro and to_avro functions to SparkR
[ https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33304: Assignee: (was: Apache Spark) > Add from_avro and to_avro functions to SparkR > - > > Key: SPARK-33304 > URL: https://issues.apache.org/jira/browse/SPARK-33304 > Project: Spark > Issue Type: Improvement > Components: R, SQL >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in > 3.0, but are still missing in R API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix
[ https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-20044. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29820 [https://github.com/apache/spark/pull/29820] > Support Spark UI behind front-end reverse proxy using a path prefix > --- > > Key: SPARK-20044 > URL: https://issues.apache.org/jira/browse/SPARK-20044 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0 >Reporter: Oliver Koeth >Assignee: Gengliang Wang >Priority: Minor > Labels: reverse-proxy, sso > Fix For: 3.1.0 > > > Purpose: allow to run the Spark web UI behind a reverse proxy with URLs > prefixed by a context root, like www.mydomain.com/spark. In particular, this > allows to access multiple Spark clusters through the same virtual host, only > distinguishing them by context root, like www.mydomain.com/cluster1, > www.mydomain.com/cluster2, and it allows to run the Spark UI in a common > cookie domain (for SSO) with other services. > [SPARK-15487] introduced some support for front-end reverse proxies by > allowing all Spark UI requests to be routed through the master UI as a single > endpoint and also added a spark.ui.reverseProxyUrl setting to define a > another proxy sitting in front of Spark. However, as noted in the comments on > [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl > includes a context root like the examples above: Most links generated by the > Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not > account for a path prefix (context root) and work only if the Spark UI "owns" > the entire virtual host. In fact, the only place in the UI where the > reverseProxyUrl seems to be used is the back-link from the worker UI to the > master UI. > The discussion on [SPARK-15487] proposes to open a new issue for the problem, > but that does not seem to have happened, so this issue aims to address the > remaining shortcomings of spark.ui.reverseProxyUrl > The problem can be partially worked around by doing content rewrite in a > front-end proxy and prefixing src="/..." or href="/..." links with a context > root. However, detecting and patching URLs in HTML output is not a robust > approach and breaks down for URLs included in custom REST responses. E.g. the > "allexecutors" REST call used from the Spark 2.1.0 application/executors page > returns links for log viewing that direct to the worker UI and do not work in > this scenario. > This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL > generation. Experiments indicate that most of this can simply be achieved by > using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase > system property. Beyond that, the places that require adaption are > - worker and application links in the master web UI > - webui URLs returned by REST interfaces > Note: It seems that returned redirect location headers do not need to be > adapted, since URL rewriting for these is commonly done in front-end proxies > and has a well-defined interface -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.
[ https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33277. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/30177 > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > > > Key: SPARK-33277 > URL: https://issues.apache.org/jira/browse/SPARK-33277 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.1.0 > > > Python/Pandas UDF right after off-heap vectorized reader could cause executor > crash. > E.g.,: > {code:java} > spark.range(0, 10, 1, 1).write.parquet(path) > spark.conf.set("spark.sql.columnVector.offheap.enabled", True) > def f(x): > return 0 > fUdf = udf(f, LongType()) > spark.read.parquet(path).select(fUdf('id')).head() > {code} > This is because, the Python evaluation consumes the parent iterator in a > separate thread and it consumes more data from the parent even after the task > ends and the parent is closed. If an off-heap column vector exists in the > parent iterator, it could cause segmentation fault which crashes the executor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33310) Relax pyspark typing for sql str functions
[ https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33310: Assignee: Daniel Himmelstein > Relax pyspark typing for sql str functions > -- > > Key: SPARK-33310 > URL: https://issues.apache.org/jira/browse/SPARK-33310 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Daniel Himmelstein >Assignee: Daniel Himmelstein >Priority: Minor > Labels: pyspark.sql.functions, type > Fix For: 3.1.0 > > > Several pyspark.sql.functions have overly strict typing, in that the type is > more restrictive than the functionality. Specifically, the function allows > specifying the column to operate on with a pyspark.sql.Column or a str. This > is handled internally by > [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50], > which accepts a Column or string. > There is a pre-existing type for this: > [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37]. > ColumnOrName is used for many of the type definitions of > pyspark.sql.functions arguments, but [not > for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162] > locate, lpad, rpad, repeat, and split. > {code:java} > def locate(substr: str, str: Column, pos: int = ...) -> Column: ... > def lpad(col: Column, len: int, pad: str) -> Column: ... > def rpad(col: Column, len: int, pad: str) -> Column: ... > def repeat(col: Column, n: int) -> Column: ... > def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code} > ColumnOrName was not added by [~zero323] since Maciej "was concerned that > this might be confusing or ambiguous", because these functions take a column > to operate on as well strings which are used in the operation. > But I think ColumnOrName makes clear that this variable refers to the column > and not a string parameter. Also there are other ways to address confusion, > such as via the docstring or by changing the argument name for the column to > col from str. > Finally, there's considerable convenience for users to not have to wrap > column names in pyspark.sql.functions.col. Elsewhere the API seems pretty > consistent in its willingness to accept columns by name and not Column object > (at least when there is not alternative meaning for a string value, exception > would be .when/.otherwise). > For example, we were calling pyspark.sql.functions.split with a string value > for the str argument (specifying which column to split). And I noticed this > when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. > For users that will enable typing in 3.1, this is a restriction in > functionality. > Pre-existing PRs to address this: > * [https://github.com/apache/spark/pull/30209] > * [https://github.com/zero323/pyspark-stubs/pull/420] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33310) Relax pyspark typing for sql str functions
[ https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33310. -- Resolution: Fixed Issue resolved by pull request 30209 [https://github.com/apache/spark/pull/30209] > Relax pyspark typing for sql str functions > -- > > Key: SPARK-33310 > URL: https://issues.apache.org/jira/browse/SPARK-33310 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Daniel Himmelstein >Assignee: Daniel Himmelstein >Priority: Minor > Labels: pyspark.sql.functions, type > Fix For: 3.1.0 > > > Several pyspark.sql.functions have overly strict typing, in that the type is > more restrictive than the functionality. Specifically, the function allows > specifying the column to operate on with a pyspark.sql.Column or a str. This > is handled internally by > [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50], > which accepts a Column or string. > There is a pre-existing type for this: > [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37]. > ColumnOrName is used for many of the type definitions of > pyspark.sql.functions arguments, but [not > for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162] > locate, lpad, rpad, repeat, and split. > {code:java} > def locate(substr: str, str: Column, pos: int = ...) -> Column: ... > def lpad(col: Column, len: int, pad: str) -> Column: ... > def rpad(col: Column, len: int, pad: str) -> Column: ... > def repeat(col: Column, n: int) -> Column: ... > def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code} > ColumnOrName was not added by [~zero323] since Maciej "was concerned that > this might be confusing or ambiguous", because these functions take a column > to operate on as well strings which are used in the operation. > But I think ColumnOrName makes clear that this variable refers to the column > and not a string parameter. Also there are other ways to address confusion, > such as via the docstring or by changing the argument name for the column to > col from str. > Finally, there's considerable convenience for users to not have to wrap > column names in pyspark.sql.functions.col. Elsewhere the API seems pretty > consistent in its willingness to accept columns by name and not Column object > (at least when there is not alternative meaning for a string value, exception > would be .when/.otherwise). > For example, we were calling pyspark.sql.functions.split with a string value > for the str argument (specifying which column to split). And I noticed this > when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. > For users that will enable typing in 3.1, this is a restriction in > functionality. > Pre-existing PRs to address this: > * [https://github.com/apache/spark/pull/30209] > * [https://github.com/zero323/pyspark-stubs/pull/420] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33312) Provide latest Spark 2.4.7 runnable distribution
Prateek Dubey created SPARK-33312: - Summary: Provide latest Spark 2.4.7 runnable distribution Key: SPARK-33312 URL: https://issues.apache.org/jira/browse/SPARK-33312 Project: Spark Issue Type: Task Components: Build Affects Versions: 2.4.7 Reporter: Prateek Dubey Not sure if this is the right approach, however it would be great if latest Spark 2.4.7 runnable distribution can be provided here - [https://spark.apache.org/downloads.html] Currently it seems the last build was done on Sept 12th' 2020. I'm working on running Spark workloads on EKS using EKS IRSA. I'm able to run Spark workloads on EKS using IRSA with Spark 3.0/ Hadoop 3.2, however I want to do the same with Spark 2.4.7/ Hadoop 2.7. Recently this PR was merged with 2.4.x - [https://github.com/apache/spark/pull/29877] and therefore I'm in need of latest Spark distribution PS: I tried building latest Spark 2.4.7 myself as well using Maven, however there are too many errors every-time when it reaches R, therefore it would be great if Spark community itself can provide the latest build. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org