[jira] [Commented] (SPARK-27196) Beginning offset 115204574 is after the ending offset 115204516 for topic

2020-11-01 Thread Yu Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224470#comment-17224470
 ] 

Yu Wang commented on SPARK-27196:
-

https://issues.apache.org/jira/browse/KAFKA-8124

> Beginning offset 115204574 is after the ending offset 115204516 for topic 
> --
>
> Key: SPARK-27196
> URL: https://issues.apache.org/jira/browse/SPARK-27196
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
> Environment: Spark : 2.3.0
> Sparks Kafka: spark-streaming-kafka-0-10_2.3.0
> Kafka Client: org.apache.kafka.kafka-clients: 0.11.0.1
>Reporter: Prasanna Talakanti
>Priority: Major
>
> We are getting this issue in production and Sparks consumer dying because of 
> Off Set issue.
> We observed the following error in Kafka Broker
> --
> [2019-03-18 14:40:14,100] WARN Unable to reconnect to ZooKeeper service, 
> session 0x1692e9ff4410004 has expired (org.apache.zookeeper.ClientCnxn)
> [2019-03-18 14:40:14,100] INFO Unable to reconnect to ZooKeeper service, 
> session 0x1692e9ff4410004 has expired, closing socket connection 
> (org.apache.zook
> eeper.ClientCnxn)
> ---
>  
> Sparks Job died with the following error:
> ERROR 2019-03-18 07:40:57,178 7924 org.apache.spark.executor.Executor 
> [Executor task launch worker for task 16] Exception in task 27.0 in stage 0.0 
> (TID 16)
> java.lang.AssertionError: assertion failed: Beginning offset 115204574 is 
> after the ending offset 115204516 for topic  partition 37. You 
> either provided an invalid fromOffset, or the Kafka topic has been damaged
>  at scala.Predef$.assert(Predef.scala:170)
>  at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:175)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33315) simplify CaseWhen with EqualTo

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224453#comment-17224453
 ] 

Apache Spark commented on SPARK-33315:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30222

> simplify CaseWhen with EqualTo
> --
>
> Key: SPARK-33315
> URL: https://issues.apache.org/jira/browse/SPARK-33315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> create table t(a int, b int, c int) using parquet;
> SELECT * 
> FROM   (SELECT CASE 
>  WHEN a = 100 THEN 1 
>  WHEN b > 1000 THEN 2 
>  WHEN c IS NOT NULL THEN 3 
>END AS x 
> FROM   t) tmp 
> WHERE  x = 2
> {code}
> {noformat}
> Before this PR:
> Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN 
> isnotnull(c#3) THEN 3 END = 2)
> After this PR:
> Filter (b#2 > 1000)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33315) simplify CaseWhen with EqualTo

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33315:


Assignee: (was: Apache Spark)

> simplify CaseWhen with EqualTo
> --
>
> Key: SPARK-33315
> URL: https://issues.apache.org/jira/browse/SPARK-33315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> create table t(a int, b int, c int) using parquet;
> SELECT * 
> FROM   (SELECT CASE 
>  WHEN a = 100 THEN 1 
>  WHEN b > 1000 THEN 2 
>  WHEN c IS NOT NULL THEN 3 
>END AS x 
> FROM   t) tmp 
> WHERE  x = 2
> {code}
> {noformat}
> Before this PR:
> Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN 
> isnotnull(c#3) THEN 3 END = 2)
> After this PR:
> Filter (b#2 > 1000)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33315) simplify CaseWhen with EqualTo

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224452#comment-17224452
 ] 

Apache Spark commented on SPARK-33315:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30222

> simplify CaseWhen with EqualTo
> --
>
> Key: SPARK-33315
> URL: https://issues.apache.org/jira/browse/SPARK-33315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> create table t(a int, b int, c int) using parquet;
> SELECT * 
> FROM   (SELECT CASE 
>  WHEN a = 100 THEN 1 
>  WHEN b > 1000 THEN 2 
>  WHEN c IS NOT NULL THEN 3 
>END AS x 
> FROM   t) tmp 
> WHERE  x = 2
> {code}
> {noformat}
> Before this PR:
> Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN 
> isnotnull(c#3) THEN 3 END = 2)
> After this PR:
> Filter (b#2 > 1000)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33315) simplify CaseWhen with EqualTo

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33315:


Assignee: Apache Spark

> simplify CaseWhen with EqualTo
> --
>
> Key: SPARK-33315
> URL: https://issues.apache.org/jira/browse/SPARK-33315
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:sql}
> create table t(a int, b int, c int) using parquet;
> SELECT * 
> FROM   (SELECT CASE 
>  WHEN a = 100 THEN 1 
>  WHEN b > 1000 THEN 2 
>  WHEN c IS NOT NULL THEN 3 
>END AS x 
> FROM   t) tmp 
> WHERE  x = 2
> {code}
> {noformat}
> Before this PR:
> Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN 
> isnotnull(c#3) THEN 3 END = 2)
> After this PR:
> Filter (b#2 > 1000)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33315) simplify CaseWhen with EqualTo

2020-11-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33315:
---

 Summary: simplify CaseWhen with EqualTo
 Key: SPARK-33315
 URL: https://issues.apache.org/jira/browse/SPARK-33315
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


{code:sql}
create table t(a int, b int, c int) using parquet;
SELECT * 
FROM   (SELECT CASE 
 WHEN a = 100 THEN 1 
 WHEN b > 1000 THEN 2 
 WHEN c IS NOT NULL THEN 3 
   END AS x 
FROM   t) tmp 
WHERE  x = 2
{code}


{noformat}
Before this PR:
Filter (CASE WHEN (a#1 = 100) THEN 1 WHEN (b#2 > 1000) THEN 2 WHEN 
isnotnull(c#3) THEN 3 END = 2)

After this PR:
Filter (b#2 > 1000)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33259) Joining 3 streams results in incorrect output

2020-11-01 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224446#comment-17224446
 ] 

Jungtaek Lim commented on SPARK-33259:
--

I'll also link to the first JIRA issue which the problem was pointed out 
earlier.

> Joining 3 streams results in incorrect output
> -
>
> Key: SPARK-33259
> URL: https://issues.apache.org/jira/browse/SPARK-33259
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.1
>Reporter: Michael
>Priority: Critical
>  Labels: correctness
>
> I encountered an issue with Structured Streaming when doing a ((A LEFT JOIN 
> B) INNER JOIN C) operation. Below you can see example code I [posted on 
> Stackoverflow|https://stackoverflow.com/questions/64503539/]...
> I created a minimal example of "sessions", that have "start" and "end" events 
> and optionally some "metadata".
> The script generates two outputs: {{sessionStartsWithMetadata}} result from 
> "start" events that are left-joined with the "metadata" events, based on 
> {{sessionId}}. A "left join" is used, since we like to get an output event 
> even when no corresponding metadata exists.
> Additionally a DataFrame {{endedSessionsWithMetadata}} is created by joining 
> "end" events to the previously created DataFrame. Here an "inner join" is 
> used, since we only want some output when a session has ended for sure.
> This code can be executed in {{spark-shell}}:
> {code:scala}
> import java.sql.Timestamp
> import org.apache.spark.sql.execution.streaming.{MemoryStream, 
> StreamingQueryWrapper}
> import org.apache.spark.sql.streaming.StreamingQuery
> import org.apache.spark.sql.{DataFrame, SQLContext}
> import org.apache.spark.sql.functions.{col, expr, lit}
> import spark.implicits._
> implicit val sqlContext: SQLContext = spark.sqlContext
> // Main data processing, regardless whether batch or stream processing
> def process(
> sessionStartEvents: DataFrame,
> sessionOptionalMetadataEvents: DataFrame,
> sessionEndEvents: DataFrame
> ): (DataFrame, DataFrame) = {
>   val sessionStartsWithMetadata: DataFrame = sessionStartEvents
> .join(
>   sessionOptionalMetadataEvents,
>   sessionStartEvents("sessionId") === 
> sessionOptionalMetadataEvents("sessionId") &&
> sessionStartEvents("sessionStartTimestamp").between(
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL
>  1 seconds")),
>   
> sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL
>  1 seconds"))
> ),
>   "left" // metadata is optional
> )
> .select(
>   sessionStartEvents("sessionId"),
>   sessionStartEvents("sessionStartTimestamp"),
>   sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
> )
>   val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
> sessionEndEvents,
> sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") 
> &&
>   sessionStartsWithMetadata("sessionStartTimestamp").between(
> sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 
> seconds")),
> sessionEndEvents("sessionEndTimestamp")
>   )
>   )
>   (sessionStartsWithMetadata, endedSessionsWithMetadata)
> }
> def streamProcessing(
> sessionStartData: Seq[(Timestamp, Int)],
> sessionOptionalMetadata: Seq[(Timestamp, Int)],
> sessionEndData: Seq[(Timestamp, Int)]
> ): (StreamingQuery, StreamingQuery) = {
>   val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionStartEventsStream.addData(sessionStartData)
>   val sessionStartEvents: DataFrame = sessionStartEventsStream
> .toDS()
> .toDF("sessionStartTimestamp", "sessionId")
> .withWatermark("sessionStartTimestamp", "1 second")
>   val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
>   val sessionOptionalMetadataEvents: DataFrame = 
> sessionOptionalMetadataEventsStream
> .toDS()
> .toDF("sessionOptionalMetadataTimestamp", "sessionId")
> .withWatermark("sessionOptionalMetadataTimestamp", "1 second")
>   val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = 
> MemoryStream[(Timestamp, Int)]
>   sessionEndEventsStream.addData(sessionEndData)
>   val sessionEndEvents: DataFrame = sessionEndEventsStream
> .toDS()
> .toDF("sessionEndTimestamp", "sessionId")
> .withWatermark("sessionEndTimestamp", "1 second")
>   val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
> process(sessionStartEvents, sessionOptionalMetadataEvents, 
> sessionEndEvents)
>   val sessionStartsWithMet

[jira] [Commented] (SPARK-12567) Add aes_encrypt and aes_decrypt UDFs

2020-11-01 Thread Saheb Kanodia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224445#comment-17224445
 ] 

Saheb Kanodia commented on SPARK-12567:
---

[~wuhao] [~vectorijk] This seems like a very common feature in other 
systems and in-demand by enterprises. Can you please re-open this?

> Add aes_encrypt and aes_decrypt UDFs
> 
>
> Key: SPARK-12567
> URL: https://issues.apache.org/jira/browse/SPARK-12567
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>Priority: Major
>  Labels: bulk-closed
>
> AES (Advanced Encryption Standard) algorithm.
> Add aes_encrypt and aes_decrypt UDFs.
> Ref:
> [Hive|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions]
> [MySQL|https://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_aes-decrypt]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33027) Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules

2020-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33027:
---

Assignee: Cheng Su

> Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules
> -
>
> Key: SPARK-33027
> URL: https://issues.apache.org/jira/browse/SPARK-33027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
>
> As a followup comment from 
> [https://github.com/apache/spark/pull/29804#issuecomment-700650620] , will 
> add the physical plan rule DisableUnnecessaryBucketedScan into AQE 
> AdaptiveSparkPlanExec.queryStagePreparationRules. This mostly needs some more 
> unit tests crafting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33027) Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules

2020-11-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33027.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30200
[https://github.com/apache/spark/pull/30200]

> Add DisableUnnecessaryBucketedScan rule to AQE queryStagePreparationRules
> -
>
> Key: SPARK-33027
> URL: https://issues.apache.org/jira/browse/SPARK-33027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> As a followup comment from 
> [https://github.com/apache/spark/pull/29804#issuecomment-700650620] , will 
> add the physical plan rule DisableUnnecessaryBucketedScan into AQE 
> AdaptiveSparkPlanExec.queryStagePreparationRules. This mostly needs some more 
> unit tests crafting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-11-01 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224435#comment-17224435
 ] 

Liya Fan commented on SPARK-33279:
--

[~bryanc] Thanks a lot for your effort. 

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-11-01 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224420#comment-17224420
 ] 

Bryan Cutler edited comment on SPARK-33279 at 11/2/20, 5:21 AM:


[~fan_li_ya] we should change the Arrow-Spark integration tests so that it 
doesn't try to build with the latest Arrow Java, and instead just test the 
latest pyarrow, which should work. I made ARROW-10457 for this.


was (Author: bryanc):
[~fan_li_ya] we should change the Arrow-Spark integration tests so that it 
doesn't try to build with the latest Arrow Java, and instead just test the 
latest pyarrow, which should work.

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33279) Spark 3.0 failure due to lack of Arrow dependency

2020-11-01 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224420#comment-17224420
 ] 

Bryan Cutler commented on SPARK-33279:
--

[~fan_li_ya] we should change the Arrow-Spark integration tests so that it 
doesn't try to build with the latest Arrow Java, and instead just test the 
latest pyarrow, which should work.

> Spark 3.0 failure due to lack of Arrow dependency
> -
>
> Key: SPARK-33279
> URL: https://issues.apache.org/jira/browse/SPARK-33279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> A recent change in Arrow has split the arrow-memory module into 3, so client 
> code must add a dependency of arrow-memory-netty (or arrow-memory-unsafe).
> This has been done in the master branch of Spark, but not in the branch-3.0 
> branch, this is causing the build in branch-3.0 to fail 
> (https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33314) Avro reader drops rows

2020-11-01 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33314:
-
Priority: Blocker  (was: Major)

> Avro reader drops rows
> --
>
> Key: SPARK-33314
> URL: https://issues.apache.org/jira/browse/SPARK-33314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Blocker
>  Labels: correctness
>
> Under certain circumstances, the V1 Avro reader drops rows. For example:
> {noformat}
> scala> val df = spark.range(0, 25).toDF("index")
> df: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> df.write.mode("overwrite").format("avro").save("index_avro")
> scala> val loaded = spark.read.format("avro").load("index_avro")
> loaded: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> loaded.collect.size
> res1: Int = 25
> scala> loaded.orderBy("index").collect.size
> res2: Int = 17   <== expected 25
> scala> 
> loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet")
> scala> spark.read.parquet("index_as_parquet").count
> res4: Long = 17
> scala>
> {noformat}
> SPARK-32346 slightly refactored the AvroFileFormat and 
> AvroPartitionReaderFactory to use a new iterator-like trait called 
> AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and 
> stores the deserialized row for the next call to RowReader#nextRow. 
> Unfortunately, sometimes hasNextRow is called twice before nextRow is called, 
> resulting in a lost row (see 
> [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132],
>  which calls records.hasNext once before calling it again 
> [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]).
> RowReader consumes the Avro record in hasNextRow, rather than nextRow, 
> because AvroDeserializer#deserialize potentially filters out the record.
> Two possible fixes that I thought of:
> 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow 
> with no intervening call to RowReader#nextRow avoids consuming more than 1 
> Avro record. This requires no changes to any code that extends RowReader, 
> just RowReader itself.
>  2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow 
> could potentially return None) and wrap any iterator that extends RowReader 
> with a new iterator created by flatMap. This last iterator will filter out 
> the Nones and extract rows from the Somes. This requires changes to 
> AvroFileFormat and AvroPartitionReaderFactory as well as RowReader.
> The first one seems simplest and most straightfoward, and doesn't require 
> changes to AvroFileFormat and AvroPartitionReaderFactory, only to 
> AvroUtils#RowReader. So I propose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33313:
-
Fix Version/s: 3.0.2
   2.4.8

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2020-11-01 Thread Arya Ketan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224379#comment-17224379
 ] 

Arya Ketan commented on SPARK-24432:


With acceptance of https://issues.apache.org/jira/browse/SPARK-30602, how is 
the design for this evolving?

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33273) Fix Flaky Test: ThriftServerQueryTestSuite. subquery_scalar_subquery_scalar_subquery_select_sql

2020-11-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224377#comment-17224377
 ] 

Hyukjin Kwon commented on SPARK-33273:
--

cc [~yumwang] do you have any clue?

> Fix Flaky Test: ThriftServerQueryTestSuite. 
> subquery_scalar_subquery_scalar_subquery_select_sql
> ---
>
> Key: SPARK-33273
> URL: https://issues.apache.org/jira/browse/SPARK-33273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>  Labels: correctness
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130369/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/
> {code}
> [info] - subquery/scalar-subquery/scalar-subquery-select.sql *** FAILED *** 
> (3 seconds, 877 milliseconds)
> [info]   Expected "[1]0   2017-05-04 01:01:0...", but got "[]0
> 2017-05-04 01:01:0..." Result did not match for query #3
> [info]   SELECT (SELECT min(t3d) FROM t3) min_t3d,
> [info]  (SELECT max(t2h) FROM t2) max_t2h
> [info]   FROM   t1
> [info]   WHERE  t1a = 'val1c' (ThriftServerQueryTestSuite.scala:197)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33314) Avro reader drops rows

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33314:


Assignee: (was: Apache Spark)

> Avro reader drops rows
> --
>
> Key: SPARK-33314
> URL: https://issues.apache.org/jira/browse/SPARK-33314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Under certain circumstances, the V1 Avro reader drops rows. For example:
> {noformat}
> scala> val df = spark.range(0, 25).toDF("index")
> df: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> df.write.mode("overwrite").format("avro").save("index_avro")
> scala> val loaded = spark.read.format("avro").load("index_avro")
> loaded: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> loaded.collect.size
> res1: Int = 25
> scala> loaded.orderBy("index").collect.size
> res2: Int = 17   <== expected 25
> scala> 
> loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet")
> scala> spark.read.parquet("index_as_parquet").count
> res4: Long = 17
> scala>
> {noformat}
> SPARK-32346 slightly refactored the AvroFileFormat and 
> AvroPartitionReaderFactory to use a new iterator-like trait called 
> AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and 
> stores the deserialized row for the next call to RowReader#nextRow. 
> Unfortunately, sometimes hasNextRow is called twice before nextRow is called, 
> resulting in a lost row (see 
> [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132],
>  which calls records.hasNext once before calling it again 
> [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]).
> RowReader consumes the Avro record in hasNextRow, rather than nextRow, 
> because AvroDeserializer#deserialize potentially filters out the record.
> Two possible fixes that I thought of:
> 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow 
> with no intervening call to RowReader#nextRow avoids consuming more than 1 
> Avro record. This requires no changes to any code that extends RowReader, 
> just RowReader itself.
>  2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow 
> could potentially return None) and wrap any iterator that extends RowReader 
> with a new iterator created by flatMap. This last iterator will filter out 
> the Nones and extract rows from the Somes. This requires changes to 
> AvroFileFormat and AvroPartitionReaderFactory as well as RowReader.
> The first one seems simplest and most straightfoward, and doesn't require 
> changes to AvroFileFormat and AvroPartitionReaderFactory, only to 
> AvroUtils#RowReader. So I propose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33314) Avro reader drops rows

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33314:


Assignee: Apache Spark

> Avro reader drops rows
> --
>
> Key: SPARK-33314
> URL: https://issues.apache.org/jira/browse/SPARK-33314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Under certain circumstances, the V1 Avro reader drops rows. For example:
> {noformat}
> scala> val df = spark.range(0, 25).toDF("index")
> df: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> df.write.mode("overwrite").format("avro").save("index_avro")
> scala> val loaded = spark.read.format("avro").load("index_avro")
> loaded: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> loaded.collect.size
> res1: Int = 25
> scala> loaded.orderBy("index").collect.size
> res2: Int = 17   <== expected 25
> scala> 
> loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet")
> scala> spark.read.parquet("index_as_parquet").count
> res4: Long = 17
> scala>
> {noformat}
> SPARK-32346 slightly refactored the AvroFileFormat and 
> AvroPartitionReaderFactory to use a new iterator-like trait called 
> AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and 
> stores the deserialized row for the next call to RowReader#nextRow. 
> Unfortunately, sometimes hasNextRow is called twice before nextRow is called, 
> resulting in a lost row (see 
> [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132],
>  which calls records.hasNext once before calling it again 
> [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]).
> RowReader consumes the Avro record in hasNextRow, rather than nextRow, 
> because AvroDeserializer#deserialize potentially filters out the record.
> Two possible fixes that I thought of:
> 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow 
> with no intervening call to RowReader#nextRow avoids consuming more than 1 
> Avro record. This requires no changes to any code that extends RowReader, 
> just RowReader itself.
>  2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow 
> could potentially return None) and wrap any iterator that extends RowReader 
> with a new iterator created by flatMap. This last iterator will filter out 
> the Nones and extract rows from the Somes. This requires changes to 
> AvroFileFormat and AvroPartitionReaderFactory as well as RowReader.
> The first one seems simplest and most straightfoward, and doesn't require 
> changes to AvroFileFormat and AvroPartitionReaderFactory, only to 
> AvroUtils#RowReader. So I propose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33314) Avro reader drops rows

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224367#comment-17224367
 ] 

Apache Spark commented on SPARK-33314:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/30221

> Avro reader drops rows
> --
>
> Key: SPARK-33314
> URL: https://issues.apache.org/jira/browse/SPARK-33314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Under certain circumstances, the V1 Avro reader drops rows. For example:
> {noformat}
> scala> val df = spark.range(0, 25).toDF("index")
> df: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> df.write.mode("overwrite").format("avro").save("index_avro")
> scala> val loaded = spark.read.format("avro").load("index_avro")
> loaded: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> loaded.collect.size
> res1: Int = 25
> scala> loaded.orderBy("index").collect.size
> res2: Int = 17   <== expected 25
> scala> 
> loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet")
> scala> spark.read.parquet("index_as_parquet").count
> res4: Long = 17
> scala>
> {noformat}
> SPARK-32346 slightly refactored the AvroFileFormat and 
> AvroPartitionReaderFactory to use a new iterator-like trait called 
> AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and 
> stores the deserialized row for the next call to RowReader#nextRow. 
> Unfortunately, sometimes hasNextRow is called twice before nextRow is called, 
> resulting in a lost row (see 
> [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132],
>  which calls records.hasNext once before calling it again 
> [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]).
> RowReader consumes the Avro record in hasNextRow, rather than nextRow, 
> because AvroDeserializer#deserialize potentially filters out the record.
> Two possible fixes that I thought of:
> 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow 
> with no intervening call to RowReader#nextRow avoids consuming more than 1 
> Avro record. This requires no changes to any code that extends RowReader, 
> just RowReader itself.
>  2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow 
> could potentially return None) and wrap any iterator that extends RowReader 
> with a new iterator created by flatMap. This last iterator will filter out 
> the Nones and extract rows from the Somes. This requires changes to 
> AvroFileFormat and AvroPartitionReaderFactory as well as RowReader.
> The first one seems simplest and most straightfoward, and doesn't require 
> changes to AvroFileFormat and AvroPartitionReaderFactory, only to 
> AvroUtils#RowReader. So I propose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33282) Replace Probot Autolabeler with Github Action

2020-11-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224366#comment-17224366
 ] 

Hyukjin Kwon commented on SPARK-33282:
--

Yeah, I noticed this but couldn't have time to take a look. Thanks for working 
on this.
cc [~dongjoon] as well FYI.

> Replace Probot Autolabeler with Github Action
> -
>
> Key: SPARK-33282
> URL: https://issues.apache.org/jira/browse/SPARK-33282
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.1
>Reporter: Kyle Bendickson
>Priority: Major
>
> The Probot Autolabeler that we were using in both the Iceberg and the Spark 
> repo is no longer working. I've confirmed that with the devleper, github user 
> [at]mithro, who has indicated that the Probot Autolabeler is end of life and 
> will not be maintained moving forward.
> PRs have not been labeled for a few weeks now.
>  
> As I'm already interfacing with ASF Infra to have the probot permissions 
> revoked from the Iceberg repo, and I've already submitted a patch to switch 
> Iceberg to the standard github labeler action, I figured I would go ahead and 
> volunteer myself to switch the Spark repo as well.
> I will have a patch to switch to the new github labeler open within a few 
> days.
>  
> Also thank you [~blue] (or [~holden]) for shepherding this! I didn't exactly 
> ask, but it was understood in our group meeting for Iceberg that I'd be 
> converting our labeler there so I figured I'd tackle the spark issue while 
> I'm getting my hands into the labeling configs anyway =)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33289) Error executing `SHOW VIEWS in {database}` using JDBC

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33289.
--
Resolution: Invalid

> Error executing `SHOW VIEWS in {database}` using JDBC
> -
>
> Key: SPARK-33289
> URL: https://issues.apache.org/jira/browse/SPARK-33289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Prakshal Jain
>Priority: Major
>
> Full description: 
> [https://stackoverflow.com/questions/60742577/spark-sql-query-show-views-in-through-hive-metastore-fails-with-missing-func]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33289) Error executing `SHOW VIEWS in {database}` using JDBC

2020-11-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224365#comment-17224365
 ] 

Hyukjin Kwon commented on SPARK-33289:
--

This seems an issue in Tableau that creates a query wrong. Also, please provide 
a reproducer together next time.

> Error executing `SHOW VIEWS in {database}` using JDBC
> -
>
> Key: SPARK-33289
> URL: https://issues.apache.org/jira/browse/SPARK-33289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Prakshal Jain
>Priority: Major
>
> Full description: 
> [https://stackoverflow.com/questions/60742577/spark-sql-query-show-views-in-through-hive-metastore-fails-with-missing-func]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33312) Provide latest Spark 2.4.7 runnable distribution

2020-11-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224364#comment-17224364
 ] 

Hyukjin Kwon commented on SPARK-33312:
--

Providing that option is pretty much an overhead to update the site - the site 
is static site which needs a manual update.
Can you use snapshots instead? 
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/

> Provide latest Spark 2.4.7 runnable distribution
> 
>
> Key: SPARK-33312
> URL: https://issues.apache.org/jira/browse/SPARK-33312
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.4.7
>Reporter: Prateek Dubey
>Priority: Major
>
> Not sure if this is the right approach, however it would be great if latest 
> Spark 2.4.7 runnable distribution can be provided here - 
> [https://spark.apache.org/downloads.html]
> Currently it seems the last build was done on Sept 12th' 2020. 
> I'm working on running Spark workloads on EKS using EKS IRSA. I'm able to run 
> Spark workloads on EKS using IRSA with Spark 3.0/ Hadoop 3.2, however I want 
> to do the same with Spark 2.4.7/ Hadoop 2.7. 
> Recently this PR was merged with 2.4.x - 
> [https://github.com/apache/spark/pull/29877] and therefore I'm in need of 
> latest Spark distribution 
>  
> PS: I tried building latest Spark 2.4.7 myself as well using Maven, however 
> there are too many errors every-time when it reaches R, therefore it would be 
> great if Spark community itself can provide the latest build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33309) Replace origin GROUPING SETS with new expression

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33309:
-
Description: 
Replace current GroupingSets just using expression

{code}
case class GroupingSets(
selectedGroupByExprs: Seq[Seq[Expression]],
groupByExprs: Seq[Expression],
child: LogicalPlan,
aggregations: Seq[NamedExpression]) extends UnaryNode {

  override def output: Seq[Attribute] = aggregations.map(_.toAttribute)

  // Needs to be unresolved before its translated to Aggregate + Expand because 
output attributes
  // will change in analysis.
  override lazy val resolved: Boolean = false
}
{code}


  was:
Replace current GroupingSets just using expression
```
case class GroupingSets(
selectedGroupByExprs: Seq[Seq[Expression]],
groupByExprs: Seq[Expression],
child: LogicalPlan,
aggregations: Seq[NamedExpression]) extends UnaryNode {

  override def output: Seq[Attribute] = aggregations.map(_.toAttribute)

  // Needs to be unresolved before its translated to Aggregate + Expand because 
output attributes
  // will change in analysis.
  override lazy val resolved: Boolean = false
}
```


> Replace origin GROUPING SETS with new expression
> 
>
> Key: SPARK-33309
> URL: https://issues.apache.org/jira/browse/SPARK-33309
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Replace current GroupingSets just using expression
> {code}
> case class GroupingSets(
> selectedGroupByExprs: Seq[Seq[Expression]],
> groupByExprs: Seq[Expression],
> child: LogicalPlan,
> aggregations: Seq[NamedExpression]) extends UnaryNode {
>   override def output: Seq[Attribute] = aggregations.map(_.toAttribute)
>   // Needs to be unresolved before its translated to Aggregate + Expand 
> because output attributes
>   // will change in analysis.
>   override lazy val resolved: Boolean = false
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33312) Provide latest Spark 2.4.7 runnable distribution

2020-11-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224363#comment-17224363
 ] 

Hyukjin Kwon commented on SPARK-33312:
--

Providing that option is pretty much an overhead to update the site - the site 
is static site which needs a manual update.
Can you use snapshots instead? 
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/

> Provide latest Spark 2.4.7 runnable distribution
> 
>
> Key: SPARK-33312
> URL: https://issues.apache.org/jira/browse/SPARK-33312
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.4.7
>Reporter: Prateek Dubey
>Priority: Major
>
> Not sure if this is the right approach, however it would be great if latest 
> Spark 2.4.7 runnable distribution can be provided here - 
> [https://spark.apache.org/downloads.html]
> Currently it seems the last build was done on Sept 12th' 2020. 
> I'm working on running Spark workloads on EKS using EKS IRSA. I'm able to run 
> Spark workloads on EKS using IRSA with Spark 3.0/ Hadoop 3.2, however I want 
> to do the same with Spark 2.4.7/ Hadoop 2.7. 
> Recently this PR was merged with 2.4.x - 
> [https://github.com/apache/spark/pull/29877] and therefore I'm in need of 
> latest Spark distribution 
>  
> PS: I tried building latest Spark 2.4.7 myself as well using Maven, however 
> there are too many errors every-time when it reaches R, therefore it would be 
> great if Spark community itself can provide the latest build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33314) Avro reader drops rows

2020-11-01 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-33314:
--
Labels: correctness  (was: )

> Avro reader drops rows
> --
>
> Key: SPARK-33314
> URL: https://issues.apache.org/jira/browse/SPARK-33314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Under certain circumstances, the V1 Avro reader drops rows. For example:
> {noformat}
> scala> val df = spark.range(0, 25).toDF("index")
> df: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> df.write.mode("overwrite").format("avro").save("index_avro")
> scala> val loaded = spark.read.format("avro").load("index_avro")
> loaded: org.apache.spark.sql.DataFrame = [index: bigint]
> scala> loaded.collect.size
> res1: Int = 25
> scala> loaded.orderBy("index").collect.size
> res2: Int = 17   <== expected 25
> scala> 
> loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet")
> scala> spark.read.parquet("index_as_parquet").count
> res4: Long = 17
> scala>
> {noformat}
> SPARK-32346 slightly refactored the AvroFileFormat and 
> AvroPartitionReaderFactory to use a new iterator-like trait called 
> AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and 
> stores the deserialized row for the next call to RowReader#nextRow. 
> Unfortunately, sometimes hasNextRow is called twice before nextRow is called, 
> resulting in a lost row (see 
> [BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132],
>  which calls records.hasNext once before calling it again 
> [here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]).
> RowReader consumes the Avro record in hasNextRow, rather than nextRow, 
> because AvroDeserializer#deserialize potentially filters out the record.
> Two possible fixes that I thought of:
> 1) keep state in RowReader such that multiple calls to RowReader#hasNextRow 
> with no intervening call to RowReader#nextRow avoids consuming more than 1 
> Avro record. This requires no changes to any code that extends RowReader, 
> just RowReader itself.
>  2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow 
> could potentially return None) and wrap any iterator that extends RowReader 
> with a new iterator created by flatMap. This last iterator will filter out 
> the Nones and extract rows from the Somes. This requires changes to 
> AvroFileFormat and AvroPartitionReaderFactory as well as RowReader.
> The first one seems simplest and most straightfoward, and doesn't require 
> changes to AvroFileFormat and AvroPartitionReaderFactory, only to 
> AvroUtils#RowReader. So I propose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33314) Avro reader drops rows

2020-11-01 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-33314:
-

 Summary: Avro reader drops rows
 Key: SPARK-33314
 URL: https://issues.apache.org/jira/browse/SPARK-33314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Bruce Robbins


Under certain circumstances, the V1 Avro reader drops rows. For example:
{noformat}
scala> val df = spark.range(0, 25).toDF("index")
df: org.apache.spark.sql.DataFrame = [index: bigint]

scala> df.write.mode("overwrite").format("avro").save("index_avro")

scala> val loaded = spark.read.format("avro").load("index_avro")
loaded: org.apache.spark.sql.DataFrame = [index: bigint]

scala> loaded.collect.size
res1: Int = 25

scala> loaded.orderBy("index").collect.size
res2: Int = 17   <== expected 25

scala> 
loaded.orderBy("index").write.mode("overwrite").format("parquet").save("index_as_parquet")

scala> spark.read.parquet("index_as_parquet").count
res4: Long = 17

scala>
{noformat}
SPARK-32346 slightly refactored the AvroFileFormat and 
AvroPartitionReaderFactory to use a new iterator-like trait called 
AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and 
stores the deserialized row for the next call to RowReader#nextRow. 
Unfortunately, sometimes hasNextRow is called twice before nextRow is called, 
resulting in a lost row (see 
[BypassMergeSortShuffleWriter#write|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L132],
 which calls records.hasNext once before calling it again 
[here|https://github.com/apache/spark/blob/69c27f49acf2fe6fbc8335bde2aac4afd4188678/core/src/main/java/org/apache/spark/shuffle/sort/BypassMergeSortShuffleWriter.java#L155]).

RowReader consumes the Avro record in hasNextRow, rather than nextRow, because 
AvroDeserializer#deserialize potentially filters out the record.

Two possible fixes that I thought of:

1) keep state in RowReader such that multiple calls to RowReader#hasNextRow 
with no intervening call to RowReader#nextRow avoids consuming more than 1 Avro 
record. This requires no changes to any code that extends RowReader, just 
RowReader itself.
 2) Move record consumption to RowReader#nextRow (such that RowReader#nextRow 
could potentially return None) and wrap any iterator that extends RowReader 
with a new iterator created by flatMap. This last iterator will filter out the 
Nones and extract rows from the Somes. This requires changes to AvroFileFormat 
and AvroPartitionReaderFactory as well as RowReader.

The first one seems simplest and most straightfoward, and doesn't require 
changes to AvroFileFormat and AvroPartitionReaderFactory, only to 
AvroUtils#RowReader. So I propose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33277:
-
Fix Version/s: 3.0.2
   2.4.8

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33277:


Assignee: Takuya Ueshin

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224362#comment-17224362
 ] 

Apache Spark commented on SPARK-33313:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30220

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224361#comment-17224361
 ] 

Apache Spark commented on SPARK-33313:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30220

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33313.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30219
[https://github.com/apache/spark/pull/30219]

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33313:


Assignee: Maciej Szymkiewicz

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30663:


Assignee: Maciej Szymkiewicz

> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Planned Work
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30663.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30219
[https://github.com/apache/spark/pull/30219]

> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Planned Work
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.1.0
>
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224349#comment-17224349
 ] 

Apache Spark commented on SPARK-33313:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30219

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33313:


Assignee: Apache Spark

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30663:


Assignee: (was: Apache Spark)

> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Planned Work
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224348#comment-17224348
 ] 

Apache Spark commented on SPARK-33313:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30219

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33313:


Assignee: (was: Apache Spark)

> R/run-tests.sh is not compatible with testthat >= 3.0
> -
>
> Key: SPARK-33313
> URL: https://issues.apache.org/jira/browse/SPARK-33313
> Project: Spark
>  Issue Type: Bug
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Currently we use {{testthat:::test_package_dir}} to run full SparkR tests 
> with {{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
> (https://github.com/r-lib/testthat/pull/1054).
> Because of that AppVeyor tests fail with
> {code:r}
> Spark package found in SPARK_HOME: C:\projects\spark\bin\..
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
>   object 'test_package_dir' not found
> Calls: ::: -> get
> Execution halted
> {code}
> It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 
> 3.0, supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224347#comment-17224347
 ] 

Apache Spark commented on SPARK-30663:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30219

> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Planned Work
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30663) Remove 1.x testthat switch once Jenkins version is updated to 2.x

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-30663:


Assignee: Apache Spark

> Remove 1.x testthat switch once Jenkins version is updated to 2.x
> -
>
> Key: SPARK-30663
> URL: https://issues.apache.org/jira/browse/SPARK-30663
> Project: Spark
>  Issue Type: Planned Work
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> As part of SPARK-23435 proposal we include {{testthat}} 1.x compatibility 
> mode 
> {code}
> if (grepl("^1\\..*", packageVersion("testthat"))) {
>  NULL,  # testthat 1.x
>  "summary") test_runner <- 
> testthat:::run_tests
>   reporter <- "summary"
> } else {
>   # testthat >= 2.0.0
>   test_runner <- testthat:::test_package_dir
>   reporter <- testthat::default_reporter()
> }
> {code}
> in {{R/pkg/tests/run-all.R}}.
> It should be removed once whole infrastructure uses {{testhat}} 2.x or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33313) R/run-tests.sh is not compatible with testthat >= 3.0

2020-11-01 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-33313:
--

 Summary: R/run-tests.sh is not compatible with testthat >= 3.0
 Key: SPARK-33313
 URL: https://issues.apache.org/jira/browse/SPARK-33313
 Project: Spark
  Issue Type: Bug
  Components: R, Tests
Affects Versions: 3.0.0, 3.1.0
Reporter: Maciej Szymkiewicz


Currently we use {{testthat:::test_package_dir}} to run full SparkR tests with 
{{testthat > 1.0}}. However, it has been dropped in {{testthat}} 3.0 
(https://github.com/r-lib/testthat/pull/1054).

Because of that AppVeyor tests fail with

{code:r}
Spark package found in SPARK_HOME: C:\projects\spark\bin\..
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : 
  object 'test_package_dir' not found
Calls: ::: -> get
Execution halted
{code}

It seems like we can use {{testthat::test_dir}} which, since {{testthat}} 3.0, 
supports {{package}} parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224339#comment-17224339
 ] 

Apache Spark commented on SPARK-33277:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/30218

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224338#comment-17224338
 ] 

Apache Spark commented on SPARK-33277:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/30218

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224329#comment-17224329
 ] 

Apache Spark commented on SPARK-33277:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/30217

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224328#comment-17224328
 ] 

Apache Spark commented on SPARK-33277:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/30217

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33304) Add from_avro and to_avro functions to SparkR

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224301#comment-17224301
 ] 

Apache Spark commented on SPARK-33304:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30216

> Add from_avro and to_avro functions to SparkR
> -
>
> Key: SPARK-33304
> URL: https://issues.apache.org/jira/browse/SPARK-33304
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in 
> 3.0, but are still missing in R API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33304) Add from_avro and to_avro functions to SparkR

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33304:


Assignee: Apache Spark

> Add from_avro and to_avro functions to SparkR
> -
>
> Key: SPARK-33304
> URL: https://issues.apache.org/jira/browse/SPARK-33304
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in 
> 3.0, but are still missing in R API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33304) Add from_avro and to_avro functions to SparkR

2020-11-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224299#comment-17224299
 ] 

Apache Spark commented on SPARK-33304:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/30216

> Add from_avro and to_avro functions to SparkR
> -
>
> Key: SPARK-33304
> URL: https://issues.apache.org/jira/browse/SPARK-33304
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in 
> 3.0, but are still missing in R API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33304) Add from_avro and to_avro functions to SparkR

2020-11-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33304:


Assignee: (was: Apache Spark)

> Add from_avro and to_avro functions to SparkR
> -
>
> Key: SPARK-33304
> URL: https://issues.apache.org/jira/browse/SPARK-33304
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> {{from_avro}} and {{to_avro}} have been added to Scala / Java / Python in 
> 3.0, but are still missing in R API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2020-11-01 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-20044.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29820
[https://github.com/apache/spark/pull/29820]

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: Oliver Koeth
>Assignee: Gengliang Wang
>Priority: Minor
>  Labels: reverse-proxy, sso
> Fix For: 3.1.0
>
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33277.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/30177

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33310) Relax pyspark typing for sql str functions

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33310:


Assignee: Daniel Himmelstein

> Relax pyspark typing for sql str functions
> --
>
> Key: SPARK-33310
> URL: https://issues.apache.org/jira/browse/SPARK-33310
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Daniel Himmelstein
>Assignee: Daniel Himmelstein
>Priority: Minor
>  Labels: pyspark.sql.functions, type
> Fix For: 3.1.0
>
>
> Several pyspark.sql.functions have overly strict typing, in that the type is 
> more restrictive than the functionality. Specifically, the function allows 
> specifying the column to operate on with a pyspark.sql.Column or a str. This 
> is handled internally by 
> [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
>  which accepts a Column or string.
> There is a pre-existing type for this: 
> [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
>  ColumnOrName is used for many of the type definitions of 
> pyspark.sql.functions arguments, but [not 
> for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
>  locate, lpad, rpad, repeat, and split.
> {code:java}
> def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
> def lpad(col: Column, len: int, pad: str) -> Column: ...
> def rpad(col: Column, len: int, pad: str) -> Column: ...
> def repeat(col: Column, n: int) -> Column: ...
> def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
> ColumnOrName was not added by [~zero323] since Maciej "was concerned that 
> this might be confusing or ambiguous", because these functions take a column 
> to operate on as well strings which are used in the operation.
> But I think ColumnOrName makes clear that this variable refers to the column 
> and not a string parameter. Also there are other ways to address confusion, 
> such as via the docstring or by changing the argument name for the column to 
> col from str.
> Finally, there's considerable convenience for users to not have to wrap 
> column names in pyspark.sql.functions.col. Elsewhere the API seems pretty 
> consistent in its willingness to accept columns by name and not Column object 
> (at least when there is not alternative meaning for a string value, exception 
> would be .when/.otherwise).
> For example, we were calling pyspark.sql.functions.split with a string value 
> for the str argument (specifying which column to split). And I noticed this 
> when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. 
> For users that will enable typing in 3.1, this is a restriction in 
> functionality.
> Pre-existing PRs to address this:
>  * [https://github.com/apache/spark/pull/30209]
>  * [https://github.com/zero323/pyspark-stubs/pull/420]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33310) Relax pyspark typing for sql str functions

2020-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33310.
--
Resolution: Fixed

Issue resolved by pull request 30209
[https://github.com/apache/spark/pull/30209]

> Relax pyspark typing for sql str functions
> --
>
> Key: SPARK-33310
> URL: https://issues.apache.org/jira/browse/SPARK-33310
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Daniel Himmelstein
>Assignee: Daniel Himmelstein
>Priority: Minor
>  Labels: pyspark.sql.functions, type
> Fix For: 3.1.0
>
>
> Several pyspark.sql.functions have overly strict typing, in that the type is 
> more restrictive than the functionality. Specifically, the function allows 
> specifying the column to operate on with a pyspark.sql.Column or a str. This 
> is handled internally by 
> [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
>  which accepts a Column or string.
> There is a pre-existing type for this: 
> [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
>  ColumnOrName is used for many of the type definitions of 
> pyspark.sql.functions arguments, but [not 
> for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
>  locate, lpad, rpad, repeat, and split.
> {code:java}
> def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
> def lpad(col: Column, len: int, pad: str) -> Column: ...
> def rpad(col: Column, len: int, pad: str) -> Column: ...
> def repeat(col: Column, n: int) -> Column: ...
> def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
> ColumnOrName was not added by [~zero323] since Maciej "was concerned that 
> this might be confusing or ambiguous", because these functions take a column 
> to operate on as well strings which are used in the operation.
> But I think ColumnOrName makes clear that this variable refers to the column 
> and not a string parameter. Also there are other ways to address confusion, 
> such as via the docstring or by changing the argument name for the column to 
> col from str.
> Finally, there's considerable convenience for users to not have to wrap 
> column names in pyspark.sql.functions.col. Elsewhere the API seems pretty 
> consistent in its willingness to accept columns by name and not Column object 
> (at least when there is not alternative meaning for a string value, exception 
> would be .when/.otherwise).
> For example, we were calling pyspark.sql.functions.split with a string value 
> for the str argument (specifying which column to split). And I noticed this 
> when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. 
> For users that will enable typing in 3.1, this is a restriction in 
> functionality.
> Pre-existing PRs to address this:
>  * [https://github.com/apache/spark/pull/30209]
>  * [https://github.com/zero323/pyspark-stubs/pull/420]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33312) Provide latest Spark 2.4.7 runnable distribution

2020-11-01 Thread Prateek Dubey (Jira)
Prateek Dubey created SPARK-33312:
-

 Summary: Provide latest Spark 2.4.7 runnable distribution
 Key: SPARK-33312
 URL: https://issues.apache.org/jira/browse/SPARK-33312
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 2.4.7
Reporter: Prateek Dubey


Not sure if this is the right approach, however it would be great if latest 
Spark 2.4.7 runnable distribution can be provided here - 
[https://spark.apache.org/downloads.html]

Currently it seems the last build was done on Sept 12th' 2020. 

I'm working on running Spark workloads on EKS using EKS IRSA. I'm able to run 
Spark workloads on EKS using IRSA with Spark 3.0/ Hadoop 3.2, however I want to 
do the same with Spark 2.4.7/ Hadoop 2.7. 

Recently this PR was merged with 2.4.x - 
[https://github.com/apache/spark/pull/29877] and therefore I'm in need of 
latest Spark distribution 

 

PS: I tried building latest Spark 2.4.7 myself as well using Maven, however 
there are too many errors every-time when it reaches R, therefore it would be 
great if Spark community itself can provide the latest build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org