[jira] [Assigned] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.
[ https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27036: Assignee: (was: Apache Spark) > Even Broadcast thread is timed out, BroadCast Job is not aborted. > - > > Key: SPARK-27036 > URL: https://issues.apache.org/jira/browse/SPARK-27036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Babulal >Priority: Minor > Attachments: image-2019-03-04-00-38-52-401.png, > image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png > > > During broadcast table job is execution if broadcast timeout > (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till > completion whereas it should abort on broadcast timeout. > Exception is thrown in console but Spark Job is still continue. > > !image-2019-03-04-00-39-38-779.png! > !image-2019-03-04-00-39-12-210.png! > > wait for some time > !image-2019-03-04-00-38-52-401.png! > !image-2019-03-04-00-34-47-884.png! > > How to Reproduce Issue > Option1 using SQL:- > create Table t1(Big Table,1M Records) > val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')"); > create Table t2(Small Table,100K records) > val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')"); > spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false) > spark.sql("set spark.sql.broadcastTimeout=2").show(false) > Run Below Query > spark.sql("create table s using parquet as select t1.* from csv_2 as > t1,csv_1 as t2 where t1._c3=t2._c3") > Option 2:- Use External DataSource and Add Delay in the #buildScan. and use > datasource for query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.
[ https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27036: Assignee: Apache Spark > Even Broadcast thread is timed out, BroadCast Job is not aborted. > - > > Key: SPARK-27036 > URL: https://issues.apache.org/jira/browse/SPARK-27036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Babulal >Assignee: Apache Spark >Priority: Minor > Attachments: image-2019-03-04-00-38-52-401.png, > image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png > > > During broadcast table job is execution if broadcast timeout > (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till > completion whereas it should abort on broadcast timeout. > Exception is thrown in console but Spark Job is still continue. > > !image-2019-03-04-00-39-38-779.png! > !image-2019-03-04-00-39-12-210.png! > > wait for some time > !image-2019-03-04-00-38-52-401.png! > !image-2019-03-04-00-34-47-884.png! > > How to Reproduce Issue > Option1 using SQL:- > create Table t1(Big Table,1M Records) > val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')"); > create Table t2(Small Table,100K records) > val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> > ("name_"+x,x%3,x)) > val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as > c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as > c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as > c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as > c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as > c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30") > df.write.csv("D:/data/par1/t4"); > spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')"); > spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false) > spark.sql("set spark.sql.broadcastTimeout=2").show(false) > Run Below Query > spark.sql("create table s using parquet as select t1.* from csv_2 as > t1,csv_1 as t2 where t1._c3=t2._c3") > Option 2:- Use External DataSource and Add Delay in the #buildScan. and use > datasource for query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27113) remove CHECK_FILES_EXIST_KEY option in file source
Wenchen Fan created SPARK-27113: --- Summary: remove CHECK_FILES_EXIST_KEY option in file source Key: SPARK-27113 URL: https://issues.apache.org/jira/browse/SPARK-27113 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27056) Remove `start-shuffle-service.sh`
[ https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27056. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23975 [https://github.com/apache/spark/pull/23975] > Remove `start-shuffle-service.sh` > -- > > Key: SPARK-27056 > URL: https://issues.apache.org/jira/browse/SPARK-27056 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 3.0.0 > > > _start-shuffle-service.sh_ was only used by Mesos before > _start-mesos-shuffle-service.sh_. > Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is > better than _start-shuffle-service.sh_. > So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27056) Remove `start-shuffle-service.sh`
[ https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27056: - Assignee: liuxian > Remove `start-shuffle-service.sh` > -- > > Key: SPARK-27056 > URL: https://issues.apache.org/jira/browse/SPARK-27056 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > > _start-shuffle-service.sh_ was only used by Mesos before > _start-mesos-shuffle-service.sh_. > Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is > better than _start-shuffle-service.sh_. > So now we should delete _start-shuffle-service.sh_ in case users use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27101) Drop the created database after the test in test_session
[ https://issues.apache.org/jira/browse/SPARK-27101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27101: - Summary: Drop the created database after the test in test_session (was: clean up testcase in SparkSessionTests3) > Drop the created database after the test in test_session > > > Key: SPARK-27101 > URL: https://issues.apache.org/jira/browse/SPARK-27101 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Sandeep Katta >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27101) Drop the created database after the test in test_session
[ https://issues.apache.org/jira/browse/SPARK-27101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27101. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24021 [https://github.com/apache/spark/pull/24021] > Drop the created database after the test in test_session > > > Key: SPARK-27101 > URL: https://issues.apache.org/jira/browse/SPARK-27101 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Trivial > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27101) Drop the created database after the test in test_session
[ https://issues.apache.org/jira/browse/SPARK-27101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27101: Assignee: Sandeep Katta > Drop the created database after the test in test_session > > > Key: SPARK-27101 > URL: https://issues.apache.org/jira/browse/SPARK-27101 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Sandeep Katta >Assignee: Sandeep Katta >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27112: Assignee: Apache Spark > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Assignee: Apache Spark >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally
[ https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27111: Assignee: Apache Spark (was: Shixiong Zhu) > A continuous query may fail with InterruptedException when kafka consumer > temporally 0 partitions temporally > > > Key: SPARK-27111 > URL: https://issues.apache.org/jira/browse/SPARK-27111 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Before a Kafka consumer gets assigned with partitions, its offset will > contain 0 partitions. However, runContinuous will still run and launch a > Spark job having 0 partitions. In this case, there is a race that epoch may > interrupt the query execution thread after `lastExecution.toRdd`, and either > `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next > `runContinuous` will get interrupted unintentionally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27112: Assignee: (was: Apache Spark) > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally
[ https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27111: Assignee: Shixiong Zhu (was: Apache Spark) > A continuous query may fail with InterruptedException when kafka consumer > temporally 0 partitions temporally > > > Key: SPARK-27111 > URL: https://issues.apache.org/jira/browse/SPARK-27111 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Before a Kafka consumer gets assigned with partitions, its offset will > contain 0 partitions. However, runContinuous will still run and launch a > Spark job having 0 partitions. In this case, there is a race that epoch may > interrupt the query execution thread after `lastExecution.toRdd`, and either > `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next > `runContinuous` will get interrupted unintentionally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-27112: - Attachment: Screen Shot 2019-02-26 at 4.10.48 PM.png > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-27112: - Attachment: Screen Shot 2019-02-26 at 4.11.26 PM.png > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-27112: - Attachment: Screen Shot 2019-02-26 at 4.11.11 PM.png > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
[ https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Gandhi updated SPARK-27112: - Attachment: Screen Shot 2019-02-26 at 4.10.26 PM.png > Spark Scheduler encounters two independent Deadlocks when trying to kill > executors either due to dynamic allocation or blacklisting > > > Key: SPARK-27112 > URL: https://issues.apache.org/jira/browse/SPARK-27112 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Parth Gandhi >Priority: Major > Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot > 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, > Screen Shot 2019-02-26 at 4.11.26 PM.png > > > Recently, a few spark users in the organization have reported that their jobs > were getting stuck. On further analysis, it was found out that there exist > two independent deadlocks and either of them occur under different > circumstances. The screenshots for these two deadlocks are attached here. > We were able to reproduce the deadlocks with the following piece of code: > > {code:java} > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.fs.{FileSystem, Path} > import org.apache.spark._ > import org.apache.spark.TaskContext > // Simple example of Word Count in Scala > object ScalaWordCount { > def main(args: Array[String]) { > if (args.length < 2) { > System.err.println("Usage: ScalaWordCount ") > System.exit(1) > } > val conf = new SparkConf().setAppName("Scala Word Count") > val sc = new SparkContext(conf) > // get the input file uri > val inputFilesUri = args(0) > // get the output file uri > val outputFilesUri = args(1) > while (true) { > val textFile = sc.textFile(inputFilesUri) > val counts = textFile.flatMap(line => line.split(" ")) > .map(word => {if (TaskContext.get.partitionId == 5 && > TaskContext.get.attemptNumber == 0) throw new Exception("Fail for > blacklisting") else (word, 1)}) > .reduceByKey(_ + _) > counts.saveAsTextFile(outputFilesUri) > val conf: Configuration = new Configuration() > val path: Path = new Path(outputFilesUri) > val hdfs: FileSystem = FileSystem.get(conf) > hdfs.delete(path, true) > } > sc.stop() > } > } > {code} > > Additionally, to ensure that the deadlock surfaces up soon enough, I also > added a small delay in the Spark code here: > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] > > {code:java} > executorIdToFailureList.remove(exec) > updateNextExpiryTime() > Thread.sleep(2000) > killBlacklistedExecutor(exec) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting
Parth Gandhi created SPARK-27112: Summary: Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting Key: SPARK-27112 URL: https://issues.apache.org/jira/browse/SPARK-27112 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 2.4.0, 3.0.0 Reporter: Parth Gandhi Recently, a few spark users in the organization have reported that their jobs were getting stuck. On further analysis, it was found out that there exist two independent deadlocks and either of them occur under different circumstances. The screenshots for these two deadlocks are attached here. We were able to reproduce the deadlocks with the following piece of code: {code:java} import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.spark._ import org.apache.spark.TaskContext // Simple example of Word Count in Scala object ScalaWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println("Usage: ScalaWordCount ") System.exit(1) } val conf = new SparkConf().setAppName("Scala Word Count") val sc = new SparkContext(conf) // get the input file uri val inputFilesUri = args(0) // get the output file uri val outputFilesUri = args(1) while (true) { val textFile = sc.textFile(inputFilesUri) val counts = textFile.flatMap(line => line.split(" ")) .map(word => {if (TaskContext.get.partitionId == 5 && TaskContext.get.attemptNumber == 0) throw new Exception("Fail for blacklisting") else (word, 1)}) .reduceByKey(_ + _) counts.saveAsTextFile(outputFilesUri) val conf: Configuration = new Configuration() val path: Path = new Path(outputFilesUri) val hdfs: FileSystem = FileSystem.get(conf) hdfs.delete(path, true) } sc.stop() } } {code} Additionally, to ensure that the deadlock surfaces up soon enough, I also added a small delay in the Spark code here: [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256] {code:java} executorIdToFailureList.remove(exec) updateNextExpiryTime() Thread.sleep(2000) killBlacklistedExecutor(exec) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally
Shixiong Zhu created SPARK-27111: Summary: A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally Key: SPARK-27111 URL: https://issues.apache.org/jira/browse/SPARK-27111 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27004) Code for https uri authentication in Spark Submit needs to be removed
[ https://issues.apache.org/jira/browse/SPARK-27004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27004: Assignee: (was: Apache Spark) > Code for https uri authentication in Spark Submit needs to be removed > - > > Key: SPARK-27004 > URL: https://issues.apache.org/jira/browse/SPARK-27004 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > The old code in Spark Submit used for uri verification according to the > comments > [here|https://github.com/apache/spark/pull/23546#issuecomment-463340476] and > [here|https://github.com/apache/spark/pull/23546#issuecomment-463366075] > needs to be removed or refactored otherwise it will cause failures with > secure http uris. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27004) Code for https uri authentication in Spark Submit needs to be removed
[ https://issues.apache.org/jira/browse/SPARK-27004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27004: Assignee: Apache Spark > Code for https uri authentication in Spark Submit needs to be removed > - > > Key: SPARK-27004 > URL: https://issues.apache.org/jira/browse/SPARK-27004 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Assignee: Apache Spark >Priority: Minor > > The old code in Spark Submit used for uri verification according to the > comments > [here|https://github.com/apache/spark/pull/23546#issuecomment-463340476] and > [here|https://github.com/apache/spark/pull/23546#issuecomment-463366075] > needs to be removed or refactored otherwise it will cause failures with > secure http uris. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27044. Resolution: Duplicate > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code
[ https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788369#comment-16788369 ] Imran Rashid commented on SPARK-27097: -- I'm kind of amazed Spark works at all on different Platforms. As you note, endianness probably cannot be different. What kind of platform difference results in this issue? Is it different versions of the JVM? I'd also be amazed if that worked properly. I'm not saying we shouldn't fix this if its easy, but maybe we should clarify how different the "platform" can be between containers in a spark app? > Avoid embedding platform-dependent offsets literally in whole-stage generated > code > -- > > Key: SPARK-27097 > URL: https://issues.apache.org/jira/browse/SPARK-27097 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Kris Mok >Priority: Critical > Labels: correctness > > Avoid embedding platform-dependent offsets literally in whole-stage generated > code. > Spark SQL performs whole-stage code generation to speed up query execution. > There are two steps to it: > Java source code is generated from the physical query plan on the driver. A > single version of the source code is generated from a query plan, and sent to > all executors. > It's compiled to bytecode on the driver to catch compilation errors before > sending to executors, but currently only the generated source code gets sent > to the executors. The bytecode compilation is for fail-fast only. > Executors receive the generated source code and compile to bytecode, then the > query runs like a hand-written Java program. > In this model, there's an implicit assumption about the driver and executors > being run on similar platforms. Some code paths accidentally embedded > platform-dependent object layout information into the generated code, such as: > {code:java} > Platform.putLong(buffer, /* offset */ 24, /* value */ 1); > {code} > This code expects a field to be at offset +24 of the buffer object, and sets > a value to that field. > But whole-stage code generation generally uses platform-dependent information > from the driver. If the object layout is significantly different on the > driver and executors, the generated code can be reading/writing to wrong > offsets on the executors, causing all kinds of data corruption. > One code pattern that leads to such problem is the use of Platform.XXX > constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET. > Bad: > {code:java} > val baseOffset = Platform.BYTE_ARRAY_OFFSET > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into > the generated code. > {code} > Good: > {code:java} > val baseOffset = "Platform.BYTE_ARRAY_OFFSET" > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will generate the offset symbolically -- Platform.putLong(buffer, > Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct > value on the executors. > {code} > Caveat: these offset constants are declared as runtime-initialized static > final in Java, so they're not compile-time constants from the Java language's > perspective. It does lead to a slightly increased size of the generated code, > but this is necessary for correctness. > NOTE: there can be other patterns that generate platform-dependent code on > the driver which is invalid on the executors. e.g. if the endianness is > different between the driver and the executors, and if some generated code > makes strong assumption about endianness, it would also be problematic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse
[ https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788354#comment-16788354 ] Apache Spark commented on SPARK-27110: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22204 > Moves some functions from AnalyzeColumnCommand to command/CommandUtils for > reuse > > > Key: SPARK-27110 > URL: https://issues.apache.org/jira/browse/SPARK-27110 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > To reuse some common logics for improving {{Analyze}} commands (See the > description of {{SPARK-25196}}for details), this ticket targets to move some > functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse
[ https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-27110: - Description: To reuse some common logics for improving {{Analyze}} commands (See the description of {{SPARK-25196}}for details), this ticket targets to move some functions from AnalyzeColumnCommand}} to {{command/CommandUtils}}. (was: To reuse some common logics for improving {{Analyze}} commands (See the description of {{SPARK-25196}}for details), this ticket targets to move some functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. ) > Moves some functions from AnalyzeColumnCommand to command/CommandUtils for > reuse > > > Key: SPARK-27110 > URL: https://issues.apache.org/jira/browse/SPARK-27110 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > To reuse some common logics for improving {{Analyze}} commands (See the > description of {{SPARK-25196}}for details), this ticket targets to move some > functions from AnalyzeColumnCommand}} to {{command/CommandUtils}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse
[ https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27110: Assignee: (was: Apache Spark) > Moves some functions from AnalyzeColumnCommand to command/CommandUtils for > reuse > > > Key: SPARK-27110 > URL: https://issues.apache.org/jira/browse/SPARK-27110 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > To reuse some common logics for improving {{Analyze}} commands (See the > description of {{SPARK-25196}}for details), this ticket targets to move some > functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse
[ https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27110: Assignee: Apache Spark > Moves some functions from AnalyzeColumnCommand to command/CommandUtils for > reuse > > > Key: SPARK-27110 > URL: https://issues.apache.org/jira/browse/SPARK-27110 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > To reuse some common logics for improving {{Analyze}} commands (See the > description of {{SPARK-25196}}for details), this ticket targets to move some > functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse
Takeshi Yamamuro created SPARK-27110: Summary: Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse Key: SPARK-27110 URL: https://issues.apache.org/jira/browse/SPARK-27110 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro To reuse some common logics for improving {{Analyze}} commands (See the description of {{SPARK-25196}}for details), this ticket targets to move some functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter
[ https://issues.apache.org/jira/browse/SPARK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27109: Assignee: (was: Apache Spark) > Refactoring of TimestampFormatter and DateFormatter > --- > > Key: SPARK-27109 > URL: https://issues.apache.org/jira/browse/SPARK-27109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > * Date/TimestampFormatter converts parsed input to Instant before converting > it to days/micros. This is unnecessary conversion because seconds and > fraction of second can be extracted (calculated) from ZoneDateTime directly > * Avoid additional extraction of TemporalQueries.localTime from > temporalAccessor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code
[ https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27097: Assignee: Apache Spark (was: Kris Mok) > Avoid embedding platform-dependent offsets literally in whole-stage generated > code > -- > > Key: SPARK-27097 > URL: https://issues.apache.org/jira/browse/SPARK-27097 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > Labels: correctness > > Avoid embedding platform-dependent offsets literally in whole-stage generated > code. > Spark SQL performs whole-stage code generation to speed up query execution. > There are two steps to it: > Java source code is generated from the physical query plan on the driver. A > single version of the source code is generated from a query plan, and sent to > all executors. > It's compiled to bytecode on the driver to catch compilation errors before > sending to executors, but currently only the generated source code gets sent > to the executors. The bytecode compilation is for fail-fast only. > Executors receive the generated source code and compile to bytecode, then the > query runs like a hand-written Java program. > In this model, there's an implicit assumption about the driver and executors > being run on similar platforms. Some code paths accidentally embedded > platform-dependent object layout information into the generated code, such as: > {code:java} > Platform.putLong(buffer, /* offset */ 24, /* value */ 1); > {code} > This code expects a field to be at offset +24 of the buffer object, and sets > a value to that field. > But whole-stage code generation generally uses platform-dependent information > from the driver. If the object layout is significantly different on the > driver and executors, the generated code can be reading/writing to wrong > offsets on the executors, causing all kinds of data corruption. > One code pattern that leads to such problem is the use of Platform.XXX > constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET. > Bad: > {code:java} > val baseOffset = Platform.BYTE_ARRAY_OFFSET > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into > the generated code. > {code} > Good: > {code:java} > val baseOffset = "Platform.BYTE_ARRAY_OFFSET" > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will generate the offset symbolically -- Platform.putLong(buffer, > Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct > value on the executors. > {code} > Caveat: these offset constants are declared as runtime-initialized static > final in Java, so they're not compile-time constants from the Java language's > perspective. It does lead to a slightly increased size of the generated code, > but this is necessary for correctness. > NOTE: there can be other patterns that generate platform-dependent code on > the driver which is invalid on the executors. e.g. if the endianness is > different between the driver and the executors, and if some generated code > makes strong assumption about endianness, it would also be problematic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code
[ https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27097: Assignee: Kris Mok (was: Apache Spark) > Avoid embedding platform-dependent offsets literally in whole-stage generated > code > -- > > Key: SPARK-27097 > URL: https://issues.apache.org/jira/browse/SPARK-27097 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Kris Mok >Priority: Critical > Labels: correctness > > Avoid embedding platform-dependent offsets literally in whole-stage generated > code. > Spark SQL performs whole-stage code generation to speed up query execution. > There are two steps to it: > Java source code is generated from the physical query plan on the driver. A > single version of the source code is generated from a query plan, and sent to > all executors. > It's compiled to bytecode on the driver to catch compilation errors before > sending to executors, but currently only the generated source code gets sent > to the executors. The bytecode compilation is for fail-fast only. > Executors receive the generated source code and compile to bytecode, then the > query runs like a hand-written Java program. > In this model, there's an implicit assumption about the driver and executors > being run on similar platforms. Some code paths accidentally embedded > platform-dependent object layout information into the generated code, such as: > {code:java} > Platform.putLong(buffer, /* offset */ 24, /* value */ 1); > {code} > This code expects a field to be at offset +24 of the buffer object, and sets > a value to that field. > But whole-stage code generation generally uses platform-dependent information > from the driver. If the object layout is significantly different on the > driver and executors, the generated code can be reading/writing to wrong > offsets on the executors, causing all kinds of data corruption. > One code pattern that leads to such problem is the use of Platform.XXX > constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET. > Bad: > {code:java} > val baseOffset = Platform.BYTE_ARRAY_OFFSET > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into > the generated code. > {code} > Good: > {code:java} > val baseOffset = "Platform.BYTE_ARRAY_OFFSET" > // codegen template: > s"Platform.putLong($buffer, $baseOffset, $value);" > This will generate the offset symbolically -- Platform.putLong(buffer, > Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct > value on the executors. > {code} > Caveat: these offset constants are declared as runtime-initialized static > final in Java, so they're not compile-time constants from the Java language's > perspective. It does lead to a slightly increased size of the generated code, > but this is necessary for correctness. > NOTE: there can be other patterns that generate platform-dependent code on > the driver which is invalid on the executors. e.g. if the endianness is > different between the driver and the executors, and if some generated code > makes strong assumption about endianness, it would also be problematic. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter
[ https://issues.apache.org/jira/browse/SPARK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27109: Assignee: Apache Spark > Refactoring of TimestampFormatter and DateFormatter > --- > > Key: SPARK-27109 > URL: https://issues.apache.org/jira/browse/SPARK-27109 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > * Date/TimestampFormatter converts parsed input to Instant before converting > it to days/micros. This is unnecessary conversion because seconds and > fraction of second can be extracted (calculated) from ZoneDateTime directly > * Avoid additional extraction of TemporalQueries.localTime from > temporalAccessor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter
Maxim Gekk created SPARK-27109: -- Summary: Refactoring of TimestampFormatter and DateFormatter Key: SPARK-27109 URL: https://issues.apache.org/jira/browse/SPARK-27109 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk * Date/TimestampFormatter converts parsed input to Instant before converting it to days/micros. This is unnecessary conversion because seconds and fraction of second can be extracted (calculated) from ZoneDateTime directly * Avoid additional extraction of TemporalQueries.localTime from temporalAccessor -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8547) xgboost exploration
[ https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8547. -- Resolution: Won't Fix I think the resolution is "xgboost4j-scala" > xgboost exploration > --- > > Key: SPARK-8547 > URL: https://issues.apache.org/jira/browse/SPARK-8547 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Major > > There has been quite a bit of excitement around xgboost: > [https://github.com/dmlc/xgboost] > It improves the parallelism of boosting by mixing boosting and bagging (where > bagging makes the algorithm more parallel). > It would worth exploring implementing this within MLlib (probably as a new > algorithm). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9610) Class and instance weighting for ML
[ https://issues.apache.org/jira/browse/SPARK-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9610. -- Resolution: Done Except for GBTs, this is done, so I'm going to close the umbrella > Class and instance weighting for ML > --- > > Key: SPARK-9610 > URL: https://issues.apache.org/jira/browse/SPARK-9610 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > > This umbrella is for tracking tasks for adding support for label or instance > weights to ML algorithms. These additions will help handle skewed or > imbalanced data, ensemble methods, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9478. -- Resolution: Duplicate > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw >Priority: Major > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14599) BaggedPoint should support weighted instances.
[ https://issues.apache.org/jira/browse/SPARK-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14599. --- Resolution: Duplicate Weighted points were added with SPARK-19591 > BaggedPoint should support weighted instances. > -- > > Key: SPARK-14599 > URL: https://issues.apache.org/jira/browse/SPARK-14599 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Major > > This JIRA addresses a TODO in bagged point to support individual sample > weights. This is a blocker for > [SPARK-9478|https://issues.apache.org/jira/browse/SPARK-9478]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27108) Add parsed CreateTable plans to Catalyst
[ https://issues.apache.org/jira/browse/SPARK-27108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27108: Assignee: (was: Apache Spark) > Add parsed CreateTable plans to Catalyst > > > Key: SPARK-27108 > URL: https://issues.apache.org/jira/browse/SPARK-27108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Ryan Blue >Priority: Major > > The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} > commands. Creates are handled only by {{SparkSqlParser}} because the logical > plans are defined in the v1 datasource package > (org.apache.spark.sql.execution.datasources). > The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like > converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult > (and error-prone) to produce v2 plans because it requires converting the AST > to v1 and the converting v1 to v2. > Instead, the catalyst parser should create plans that represent exactly what > was parsed, after validation like ensuring no duplicate clauses. Then those > plans should be converted to v1 or v2 plans in the analyzer. This structure > will avoid errors caused by multiple layers of translation and keeps v1 and > v2 plans separate to ensure that v1 has no behavior changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27108) Add parsed CreateTable plans to Catalyst
[ https://issues.apache.org/jira/browse/SPARK-27108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27108: Assignee: Apache Spark > Add parsed CreateTable plans to Catalyst > > > Key: SPARK-27108 > URL: https://issues.apache.org/jira/browse/SPARK-27108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} > commands. Creates are handled only by {{SparkSqlParser}} because the logical > plans are defined in the v1 datasource package > (org.apache.spark.sql.execution.datasources). > The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like > converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult > (and error-prone) to produce v2 plans because it requires converting the AST > to v1 and the converting v1 to v2. > Instead, the catalyst parser should create plans that represent exactly what > was parsed, after validation like ensuring no duplicate clauses. Then those > plans should be converted to v1 or v2 plans in the analyzer. This structure > will avoid errors caused by multiple layers of translation and keeps v1 and > v2 plans separate to ensure that v1 has no behavior changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788206#comment-16788206 ] Eyal Farago commented on SPARK-17556: - why was this abandoned? [~viirya]'s pull request seems promising. I think the last comment by [~LI,Xiao] applies for current implementation as well as executors hold the entire broadcast anyway (assuming they ran task that used it) - so memory footprint on the executors side doesn't change, re. performance regression in case of multiple smaller partitions this also applies for current implementation as the RDD partitions has to be calculated and transferred to the driver. one thing I personally think could be improved in [~viirya]'s PR was the requirement for the RDD to be pre-persisted, I think blocks could be evaluated in the mapPartition operation performed in the newly introduced RDD.broadcast method, this would have solved most comments by [~holdenk_amp] in the PR. > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin >Priority: Major > Attachments: executor broadcast.pdf, executor-side-broadcast.pdf > > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27108) Add parsed CreateTable plans to Catalyst
Ryan Blue created SPARK-27108: - Summary: Add parsed CreateTable plans to Catalyst Key: SPARK-27108 URL: https://issues.apache.org/jira/browse/SPARK-27108 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Ryan Blue The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} commands. Creates are handled only by {{SparkSqlParser}} because the logical plans are defined in the v1 datasource package (org.apache.spark.sql.execution.datasources). The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult (and error-prone) to produce v2 plans because it requires converting the AST to v1 and the converting v1 to v2. Instead, the catalyst parser should create plans that represent exactly what was parsed, after validation like ensuring no duplicate clauses. Then those plans should be converted to v1 or v2 plans in the analyzer. This structure will avoid errors caused by multiple layers of translation and keeps v1 and v2 plans separate to ensure that v1 has no behavior changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27103) SparkSql reserved keywords don't list in alphabet order
[ https://issues.apache.org/jira/browse/SPARK-27103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27103. --- Resolution: Fixed Assignee: SongYadong Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23985 > SparkSql reserved keywords don't list in alphabet order > --- > > Key: SPARK-27103 > URL: https://issues.apache.org/jira/browse/SPARK-27103 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: SongYadong >Assignee: SongYadong >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Current spark-sql reserved keywords hasn't listed in alphabetical order. > In test suite some repeated words need to remove. Also better to add some > comments for remind. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27093) Honor ParseMode in AvroFileFormat
[ https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788177#comment-16788177 ] Tim Cerexhe commented on SPARK-27093: - Thanks for that idea [~Gengliang.Wang]. It catches some of the failure modes we need to protect against. However we also have files that do not conform to the requisite schema (or have malformed schemas, eg. with invalid byte sequences in keys, since they come from user uploads over faulty networks), and these exceptions aren't currently being squashed. If these failure modes were treated as "corrupt" files then this would completely satisfy our needs (though this may be a stretch of the definition). I've uploaded our internal patch for your reference: https://github.com/apache/spark/pull/24027 > Honor ParseMode in AvroFileFormat > - > > Key: SPARK-27093 > URL: https://issues.apache.org/jira/browse/SPARK-27093 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Tim Cerexhe >Priority: Major > > The Avro reader is missing the ability to handle malformed or truncated files > like the JSON reader. Currently it throws exceptions when it encounters any > bad or truncated record in an Avro file, causing the entire Spark job to fail > from a single dodgy file. > Ideally the AvroFileFormat would accept a Permissive or DropMalformed > ParseMode like Spark's JSON format. This would enable the the Avro reader to > drop bad records and continue processing the good records rather than abort > the entire job. > Obviously the default could remain as FailFastMode, which is the current > effective behavior, so this wouldn’t break any existing users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27093) Honor ParseMode in AvroFileFormat
[ https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27093: Assignee: (was: Apache Spark) > Honor ParseMode in AvroFileFormat > - > > Key: SPARK-27093 > URL: https://issues.apache.org/jira/browse/SPARK-27093 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Tim Cerexhe >Priority: Major > > The Avro reader is missing the ability to handle malformed or truncated files > like the JSON reader. Currently it throws exceptions when it encounters any > bad or truncated record in an Avro file, causing the entire Spark job to fail > from a single dodgy file. > Ideally the AvroFileFormat would accept a Permissive or DropMalformed > ParseMode like Spark's JSON format. This would enable the the Avro reader to > drop bad records and continue processing the good records rather than abort > the entire job. > Obviously the default could remain as FailFastMode, which is the current > effective behavior, so this wouldn’t break any existing users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27093) Honor ParseMode in AvroFileFormat
[ https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27093: Assignee: Apache Spark > Honor ParseMode in AvroFileFormat > - > > Key: SPARK-27093 > URL: https://issues.apache.org/jira/browse/SPARK-27093 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Tim Cerexhe >Assignee: Apache Spark >Priority: Major > > The Avro reader is missing the ability to handle malformed or truncated files > like the JSON reader. Currently it throws exceptions when it encounters any > bad or truncated record in an Avro file, causing the entire Spark job to fail > from a single dodgy file. > Ideally the AvroFileFormat would accept a Permissive or DropMalformed > ParseMode like Spark's JSON format. This would enable the the Avro reader to > drop bad records and continue processing the good records rather than abort > the entire job. > Obviously the default could remain as FailFastMode, which is the current > effective behavior, so this wouldn’t break any existing users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27079) Fix typo & Remove useless imports
[ https://issues.apache.org/jira/browse/SPARK-27079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27079: - Assignee: EdisonWang > Fix typo & Remove useless imports > - > > Key: SPARK-27079 > URL: https://issues.apache.org/jira/browse/SPARK-27079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Assignee: EdisonWang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27079) Fix typo & Remove useless imports
[ https://issues.apache.org/jira/browse/SPARK-27079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27079. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24000 [https://github.com/apache/spark/pull/24000] > Fix typo & Remove useless imports > - > > Key: SPARK-27079 > URL: https://issues.apache.org/jira/browse/SPARK-27079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Assignee: EdisonWang >Priority: Trivial > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27079) Fix typo & Remove useless imports
[ https://issues.apache.org/jira/browse/SPARK-27079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27079: -- Priority: Trivial (was: Minor) Yeah [~EdisonWang] this isnt' meaningful as a JIRA. If it's truly minor, in that describing the minor issue you're addressing is about the same as just making the PR, just skip the JIRA > Fix typo & Remove useless imports > - > > Key: SPARK-27079 > URL: https://issues.apache.org/jira/browse/SPARK-27079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
[ https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27090: Assignee: (was: Apache Spark) > Deplementing old LEGACY_DRIVER_IDENTIFIER ("") > -- > > Key: SPARK-27090 > URL: https://issues.apache.org/jira/browse/SPARK-27090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places > along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver > is running or an executor. > The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So > I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
[ https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27090: Assignee: Apache Spark > Deplementing old LEGACY_DRIVER_IDENTIFIER ("") > -- > > Key: SPARK-27090 > URL: https://issues.apache.org/jira/browse/SPARK-27090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places > along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver > is running or an executor. > The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So > I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788075#comment-16788075 ] Hien Luu commented on SPARK-27027: -- Hi [~hyukjin.kwon], yes I did and here is how I started the spark-shell on my Mac. ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0 My spark version is: scala> spark.version res0: String = 2.4.0 > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
[ https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788021#comment-16788021 ] Dhruve Ashar commented on SPARK-27107: -- [~dongjoon] can you review the PR for ORC to fix this issue? [https://github.com/apache/orc/pull/372] Once this is merged, we can fix the issue in spark as well. Until then the only workaround is to use the hive based implementation. > Spark SQL Job failing because of Kryo buffer overflow with ORC > -- > > Key: SPARK-27107 > URL: https://issues.apache.org/jira/browse/SPARK-27107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Dhruve Ashar >Priority: Major > > The issue occurs while trying to read ORC data and setting the SearchArgument. > {code:java} > Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. > Available: 0, required: 9 > Serialization trace: > literalList > (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) > leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) > at com.esotericsoftware.kryo.io.Output.require(Output.java:163) > at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) > at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) > at > com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) > at > com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) > at > com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) > at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) > at > org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) > at > org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) > at > org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) > at > org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) >
[jira] [Created] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC
Dhruve Ashar created SPARK-27107: Summary: Spark SQL Job failing because of Kryo buffer overflow with ORC Key: SPARK-27107 URL: https://issues.apache.org/jira/browse/SPARK-27107 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2 Reporter: Dhruve Ashar The issue occurs while trying to read ORC data and setting the SearchArgument. {code:java} Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 9 Serialization trace: literalList (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl) leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl) at com.esotericsoftware.kryo.io.Output.require(Output.java:163) at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614) at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538) at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147) at com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100) at com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552) at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534) at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96) at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156) at scala.Option.foreach(Option.scala:257) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
[jira] [Commented] (SPARK-25928) NoSuchMethodError net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
[ https://issues.apache.org/jira/browse/SPARK-25928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787996#comment-16787996 ] Abhinay commented on SPARK-25928: - We are running into the same issue with EMR 5.20.0 with Oozie 5.0.0 and Spark 2.4.0. The workaround seemed to be working for now but it does require some additional setup either converting to Snappy or deleting the lz 1.4.0 in Oozie share lib. The feedback from AWS support team was it was more of an upstream issue with Spark. Can we know when this issue will be resolved? > NoSuchMethodError > net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V > - > > Key: SPARK-25928 > URL: https://issues.apache.org/jira/browse/SPARK-25928 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.1 > Environment: EMR 5.17 which is using oozie 5.0.0 and spark 2.3.1 >Reporter: Jerry Chabot >Priority: Major > > I am not sure if this is an Oozie problem, a Spark problem or a user error. > It is blocking our upcoming release. > We are upgrading from Amazon's EMR 5.7 to EMR 5.17. The version changes are: > oozie 4.3.0 -> 5.0.0 > spark 2.1.1 -> 2.3.1 > All our Oozie/Spark jobs were working in EMR 5.7. After ugprading, some of > our jobs which use a spark action are failing with the NoSuchMethod as shown > further in the description. It seems like conflicting classes. > I noticed the spark share lib directory has two versions of the LZ4 jar. > sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/lib_20181029182704/spark/*lz* > -rw-r--r-- 3 oozie oozie 79845 2018-10-29 18:27 > /user/oozie/share/lib/lib_20181029182704/spark/compress-lzf-1.0.3.jar > -rw-r--r-- 3 hdfs oozie 236880 2018-11-01 18:22 > /user/oozie/share/lib/lib_20181029182704/spark/lz4-1.3.0.jar > -rw-r--r-- 3 oozie oozie 370119 2018-10-29 18:27 > /user/oozie/share/lib/lib_20181029182704/spark/lz4-java-1.4.0.jar > But, both of these jars have the constructor > LZ4BlockInputStream(java/io/InputStream). The spark/jars directory has only > lz4-java-1.4.0.jar. share lib seems to be getting it from the > /usr/lib/oozie/oozie-sharelib.tar.gz. > Unfortunately, my team member that knows most about Spark is on vacation. > Does anyone have any suggestions on how best to troubleshoot this problem? > Here is the strack trace. > diagnostics: User class threw exception: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, ip-172-27-113-49.ec2.internal, executor 2): > java.lang.NoSuchMethodError: > net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V > at > org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1346) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Assigned] (SPARK-27106) merge CaseInsensitiveStringMap and DataSourceOptions
[ https://issues.apache.org/jira/browse/SPARK-27106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27106: Assignee: Wenchen Fan (was: Apache Spark) > merge CaseInsensitiveStringMap and DataSourceOptions > > > Key: SPARK-27106 > URL: https://issues.apache.org/jira/browse/SPARK-27106 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala
[ https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787989#comment-16787989 ] Val Feldsher commented on SPARK-25863: -- [~maropu] yes it works now, I still see the warning but it doesn't crash. Thanks! > java.lang.UnsupportedOperationException: empty.max at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475) > - > > Key: SPARK-25863 > URL: https://issues.apache.org/jira/browse/SPARK-25863 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 2.3.1, 2.3.2 >Reporter: Ruslan Dautkhanov >Assignee: Takeshi Yamamuro >Priority: Major > Labels: cache, catalyst, code-generation > Fix For: 2.3.4, 2.4.2, 3.0.0 > > > Failing task : > {noformat} > An error occurred while calling o2875.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 > in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage > 21413.0 (TID 4057314, pc1udatahad117, executor 431): > java.lang.UnsupportedOperationException: empty.max > at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229) > at scala.collection.AbstractTraversable.max(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) > at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318) > at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263) > at > org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at
[jira] [Assigned] (SPARK-27106) merge CaseInsensitiveStringMap and DataSourceOptions
[ https://issues.apache.org/jira/browse/SPARK-27106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27106: Assignee: Apache Spark (was: Wenchen Fan) > merge CaseInsensitiveStringMap and DataSourceOptions > > > Key: SPARK-27106 > URL: https://issues.apache.org/jira/browse/SPARK-27106 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27106) merge CaseInsensitiveStringMap and DataSourceOptions
Wenchen Fan created SPARK-27106: --- Summary: merge CaseInsensitiveStringMap and DataSourceOptions Key: SPARK-27106 URL: https://issues.apache.org/jira/browse/SPARK-27106 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables
[ https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Kramer resolved SPARK-23012. - Resolution: Fixed > Support for predicate pushdown and partition pruning when left joining large > Hive tables > > > Key: SPARK-23012 > URL: https://issues.apache.org/jira/browse/SPARK-23012 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Rick Kramer >Priority: Major > Fix For: 2.4.0 > > > We have a hive view which left outer joins several large, partitioned orc > hive tables together on date. When the view is used in a hive query, hive > pushes date predicates down into the joins and prunes the partitions for all > tables. When I use this view from pyspark, the predicate is only used to > prune the left-most table and all partitions from the additional tables are > selected. > For example, consider two partitioned hive tables a & b joined in a view: > create table a ( >a_val string > ) > partitioned by (ds string) > stored as orc; > create table b ( >b_val string > ) > partitioned by (ds string) > stored as orc; > create view example_view as > select > a_val > , b_val > , ds > from a > left outer join b on b.ds = a.ds > Then in pyspark you might try to query from the view filtering on ds: > spark.table('example_view').filter(F.col('ds') == '2018-01-01') > If table a and b are large, this results in a plan that filters a on ds = > 2018-01-01, but selects scans all partitions of table b. > If the join in the view is changed to an inner join, the predicate gets > pushed down to a & b and the partitions are pruned as you'd expect. > In practice, the view is fairly complex and contains a lot of business logic > we'd prefer not to replicate in pyspark if we can avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables
[ https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Kramer updated SPARK-23012: Fix Version/s: 2.4.0 > Support for predicate pushdown and partition pruning when left joining large > Hive tables > > > Key: SPARK-23012 > URL: https://issues.apache.org/jira/browse/SPARK-23012 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Rick Kramer >Priority: Major > Fix For: 2.4.0 > > > We have a hive view which left outer joins several large, partitioned orc > hive tables together on date. When the view is used in a hive query, hive > pushes date predicates down into the joins and prunes the partitions for all > tables. When I use this view from pyspark, the predicate is only used to > prune the left-most table and all partitions from the additional tables are > selected. > For example, consider two partitioned hive tables a & b joined in a view: > create table a ( >a_val string > ) > partitioned by (ds string) > stored as orc; > create table b ( >b_val string > ) > partitioned by (ds string) > stored as orc; > create view example_view as > select > a_val > , b_val > , ds > from a > left outer join b on b.ds = a.ds > Then in pyspark you might try to query from the view filtering on ds: > spark.table('example_view').filter(F.col('ds') == '2018-01-01') > If table a and b are large, this results in a plan that filters a on ds = > 2018-01-01, but selects scans all partitions of table b. > If the join in the view is changed to an inner join, the predicate gets > pushed down to a & b and the partitions are pruned as you'd expect. > In practice, the view is fairly complex and contains a lot of business logic > we'd prefer not to replicate in pyspark if we can avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables
[ https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787968#comment-16787968 ] Rick Kramer commented on SPARK-23012: - [~Saurabh Santhosh] We worked around it by materializing the hive view as a cached table and used the table from spark. I did retest this when 2.3.0 was released and I'm pretty sure the issue was still there. Good to know the 2.4 fixes it. > Support for predicate pushdown and partition pruning when left joining large > Hive tables > > > Key: SPARK-23012 > URL: https://issues.apache.org/jira/browse/SPARK-23012 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Rick Kramer >Priority: Major > > We have a hive view which left outer joins several large, partitioned orc > hive tables together on date. When the view is used in a hive query, hive > pushes date predicates down into the joins and prunes the partitions for all > tables. When I use this view from pyspark, the predicate is only used to > prune the left-most table and all partitions from the additional tables are > selected. > For example, consider two partitioned hive tables a & b joined in a view: > create table a ( >a_val string > ) > partitioned by (ds string) > stored as orc; > create table b ( >b_val string > ) > partitioned by (ds string) > stored as orc; > create view example_view as > select > a_val > , b_val > , ds > from a > left outer join b on b.ds = a.ds > Then in pyspark you might try to query from the view filtering on ds: > spark.table('example_view').filter(F.col('ds') == '2018-01-01') > If table a and b are large, this results in a plan that filters a on ds = > 2018-01-01, but selects scans all partitions of table b. > If the join in the view is changed to an inner join, the predicate gets > pushed down to a & b and the partitions are pruned as you'd expect. > In practice, the view is fairly complex and contains a lot of business logic > we'd prefer not to replicate in pyspark if we can avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787834#comment-16787834 ] Gabor Somogyi commented on SPARK-27044: --- Thanks [~herberts] rebuilt shaded jar was my main use-case as well. Proceeding... > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
[ https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787822#comment-16787822 ] Attila Zsolt Piros commented on SPARK-27090: I would have waited a little bit to find out what others opinion about this. But fine for me. > Deplementing old LEGACY_DRIVER_IDENTIFIER ("") > -- > > Key: SPARK-27090 > URL: https://issues.apache.org/jira/browse/SPARK-27090 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places > along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver > is running or an executor. > The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So > I think we have a chance to get rid of the LEGACY_DRIVER_IDENTIFIER. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787815#comment-16787815 ] Ajith S commented on SPARK-27060: - Just checked on master branch, seems the property doesn't work anymore {{Spark master: local[*], Application Id: local-1552045445603}} {{spark-sql> set spark.sql.parser.ansi.enabled;}} {{spark.sql.parser.ansi.enabled true}} {{Time taken: 2.178 seconds, Fetched 1 row(s)}} {{spark-sql> CREATE TABLE CREATE(a int);}} {{Time taken: 0.502 seconds}} {{spark-sql> SHOW TABLES;}} {{default create false}} {{Time taken: 0.121 seconds, Fetched 1 row(s)}} {{spark-sql> }} > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > {code} > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > {code} > Hive-Behaviour : > {code} > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24252) DataSourceV2: Add catalog support
[ https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24252: --- Assignee: Ryan Blue > DataSourceV2: Add catalog support > - > > Key: SPARK-24252 > URL: https://issues.apache.org/jira/browse/SPARK-24252 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > DataSourceV2 needs to support create and drop catalog operations in order to > support logical plans like CTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787797#comment-16787797 ] Ajith S edited comment on SPARK-27060 at 3/8/19 11:20 AM: -- [~sachin1729] can setting *spark.sql.parser.ansi.enabled=true* fix this.? this configuration is to make spark follow ANSI SQL standards strictly if it doesn't, it may be issue was (Author: ajithshetty): [~sachin1729] can setting *spark.sql.parser.ansi.enabled=true* fix this.? this configuration is to make spark follow ANSI SQL standards strictly > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > {code} > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > {code} > Hive-Behaviour : > {code} > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24252) DataSourceV2: Add catalog support
[ https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24252. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23915 [https://github.com/apache/spark/pull/23915] > DataSourceV2: Add catalog support > - > > Key: SPARK-24252 > URL: https://issues.apache.org/jira/browse/SPARK-24252 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > DataSourceV2 needs to support create and drop catalog operations in order to > support logical plans like CTAS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787803#comment-16787803 ] Ajith S commented on SPARK-27060: - Please refer SPARK-26215 > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > {code} > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > {code} > Hive-Behaviour : > {code} > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName
[ https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787797#comment-16787797 ] Ajith S commented on SPARK-27060: - [~sachin1729] can setting *spark.sql.parser.ansi.enabled=true* fix this.? this configuration is to make spark follow ANSI SQL standards strictly > DDL Commands are accepting Keywords like create, drop as tableName > -- > > Key: SPARK-27060 > URL: https://issues.apache.org/jira/browse/SPARK-27060 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sachin Ramachandra Setty >Priority: Minor > > Seems to be a compatibility issue compared to other components such as hive > and mySql. > DDL commands are successful even though the tableName is same as keyword. > Tested with columnNames as well and issue exists. > Whereas, Hive-Beeline is throwing ParseException and not accepting keywords > as tableName or columnName and mySql is accepting keywords only as columnName. > Spark-Behaviour : > {code} > Connected to: Spark SQL (version 2.3.2.0101) > CLI_DBMS_APPID > Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.255 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.257 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.236 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create; > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.168 seconds) > 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.111 seconds) > 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float); > +-+--+ > | Result | > +-+--+ > +-+--+ > No rows selected (0.093 seconds) > {code} > Hive-Behaviour : > {code} > Connected to: Apache Hive (version 3.1.0) > Driver: Hive JDBC (version 3.1.0) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 3.1.0 by Apache Hive > 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:13 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float); > Error: Error while compiling statement: FAILED: ParseException line 1:18 > cannot recognize input near 'float' 'float' ')' in column name or constraint > (state=42000,code=4) > 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'create' '(' 'id' in table name > (state=42000,code=4) > 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int); > Error: Error while compiling statement: FAILED: ParseException line 1:11 > cannot recognize input near 'drop' '(' 'id' in table name > (state=42000,code=4) > mySql : > CREATE TABLE CREATE(ID integer); > Error: near "CREATE": syntax error > CREATE TABLE DROP(ID integer); > Error: near "DROP": syntax error > CREATE TABLE TAB1(FLOAT FLOAT); > Success > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error
[ https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787792#comment-16787792 ] Steve Loughran commented on SPARK-27098: It was my suggestion to file this. * If this was AWS S3 this would "just" be AWS S3's eventul consistency surfacing on renames: the directory listing needed to mimic the rename missing the newly committed files -most likely when the write is immediately before the rename (v2 task commit, v1 job commit + the straggler tasks), with the solutions being the standard ones: use a table in dynamo for list consistency, or a zero-rename-committer which doesn't need consistent listings. (Or: Iceberg) * But this is Ceph, which is, AFAIK, consistent. # Who has played with Ceph as the destination store for queries? Through the S3A libraries? # What do people think can be enabled/added to the spark-level committers to detect this problem. The tasks know the files they've actually created and can report to the job committer -it could do a post-job-commit audit of the output and fail if something is missing. [~mwlon] is going to be the one trying to debug this. Martin: # you can get some more logging of what's up in the S3A code by setting the log for {{org.apache.hadoop.fs.s3a.S3AFileSystem}} to debug and looking for log entries beginning "Rename path". At least I think so, that 2.7.x codebase is 3+ years old, frozen for all but security fixes for 12 months, and never going to get another release (related to the AWS SDK, ironically). # the Hadoop 2.9.x releases do have S3Guard in, and while using a remote DDB table to add consistency to a local Ceph store is pretty inefficient, it'd be interesting to see whether enabling it would make this problem go away. In which case, you've just found a bug in Ceph # [Ryan's S3 committers|https://github.com/rdblue/s3committer] do work with hadoop 2.7.x. Try them > Flaky missing file parts when writing to Ceph without error > --- > > Key: SPARK-27098 > URL: https://issues.apache.org/jira/browse/SPARK-27098 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Martin Loncaric >Priority: Major > > https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233 > Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. > occasionally a file part will be missing; i.e. part 3 here: > ``` > > aws s3 ls my-bucket/folder/ > 2019-02-28 13:07:21 0 _SUCCESS > 2019-02-28 13:06:58 79428651 > part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:06:59 79586172 > part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:00 79561910 > part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:01 79192617 > part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:07 79364413 > part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:08 79623254 > part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:10 79445030 > part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:10 79474923 > part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:11 79477310 > part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:12 79331453 > part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:13 79567600 > part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:13 79388012 > part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:14 79308387 > part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:15 79455483 > part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:17 79512342 > part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:18 79403307 > part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:18 79617769 > part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:19 79333534 > part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > 2019-02-28 13:07:20 79543324 > part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet > ``` > However, the write succeeds and leaves a _SUCCESS file. > This can be caught by additionally checking afterward whether the number of > written file parts agrees with the number of partitions, but Spark should at > least fail on its
[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787788#comment-16787788 ] Mathias Herberts commented on SPARK-27044: -- This is exactly what Jungtaek said, the ability to specify packages which have the non default classifier to fetch specific jars such as shaded/uber jars which is the most common use case. > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24291) Data source table is not displaying records when files are uploaded to table location
[ https://issues.apache.org/jira/browse/SPARK-24291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787786#comment-16787786 ] Sushanta Sen commented on SPARK-24291: -- Yes,after refresh it fetches the data.But why this is happening when tables created with 'USING'. > Data source table is not displaying records when files are uploaded to table > location > - > > Key: SPARK-24291 > URL: https://issues.apache.org/jira/browse/SPARK-24291 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3 >Reporter: Sushanta Sen >Priority: Major > > Precondition: > 1.Already one .orc file exists in the /tmp/orcdata/ location > # Launch Spark-sql > # spark-sql> CREATE TABLE os_orc (name string, version string, other string) > USING ORC OPTIONS (path '/tmp/orcdata/'); > # spark-sql> select * from os_orc; > Spark 2.3.0 Apache > Time taken: 2.538 seconds, Fetched 1 row(s) > # pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata* > Found 1 items > -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 > /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > pc1:/opt/# *./hadoop fs -copyFromLocal > /opt/OS/loaddata/orcdata/part-1-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > /tmp/orcdata/data2.orc* > pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata* > Found *2* items > -rw-r--r-- 3 spark hadoop 475 2018-05-15 14:59 /tmp/orcdata/data2.orc > -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 > /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc > pc1:/opt/# ** > 5. Again execute the select command on the table os_orc > spark-sql> select * from os_orc; > Spark 2.3.0 Apache > Time taken: 1.528 seconds, Fetched {color:#FF}1 row(s){color} > Actual Result: On executing select command it does not display the all the > records exist in the data source table location > Expected Result: All the records should be fetched and displayed for the data > source table from the location > NB: > 1.On exiting and relaunching the spark-sql session, select command fetches > the correct # of records. > 2.This issue is valid for all the data source tables created with 'Using' . > I came across this use case in Spark 2.2.1 when tried to reproduce a customer > site observation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787784#comment-16787784 ] Gabor Somogyi commented on SPARK-27044: --- I've already started but wanted to double check things from usage perspective. > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27098) Flaky missing file parts when writing to Ceph without error
[ https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-27098: --- Description: https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233 Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. occasionally a file part will be missing; i.e. part 3 here: ``` > aws s3 ls my-bucket/folder/ 2019-02-28 13:07:21 0 _SUCCESS 2019-02-28 13:06:58 79428651 part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:06:59 79586172 part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:00 79561910 part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:01 79192617 part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:07 79364413 part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:08 79623254 part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:10 79445030 part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:10 79474923 part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:11 79477310 part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:12 79331453 part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:13 79567600 part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:13 79388012 part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:14 79308387 part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:15 79455483 part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:17 79512342 part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:18 79403307 part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:18 79617769 part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:19 79333534 part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:20 79543324 part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet ``` However, the write succeeds and leaves a _SUCCESS file. This can be caught by additionally checking afterward whether the number of written file parts agrees with the number of partitions, but Spark should at least fail on its own and leave a meaningful stack trace in this case. was: https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233 Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, occasionally a file part will be missing; i.e. part 3 here: ``` > aws s3 ls my-bucket/folder/ 2019-02-28 13:07:21 0 _SUCCESS 2019-02-28 13:06:58 79428651 part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:06:59 79586172 part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:00 79561910 part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:01 79192617 part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:07 79364413 part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:08 79623254 part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:10 79445030 part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:10 79474923 part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:11 79477310 part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:12 79331453 part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:13 79567600 part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:13 79388012 part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:14 79308387 part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:15 79455483 part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:17 79512342 part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:18 79403307 part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:18 79617769 part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:19 79333534 part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet 2019-02-28 13:07:20 79543324
[jira] [Comment Edited] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787762#comment-16787762 ] Jungtaek Lim edited comment on SPARK-27044 at 3/8/19 10:39 AM: --- It's like "hive-exec" with default classifier vs "core" classifier. Former includes all transitive dependencies (without relocating) whereas latter just have itself and let Maven deal with transitive dependencies. I've some experience regarding this: if no one minds I would like to work on this. [~gsomogyi] Are you OK with this? It's perfectly OK if you already took a step on this. was (Author: kabhwan): It's like "hive-exec" with default classifier vs "core" classifier. Former includes all transitive dependencies (without relocating) whereas latter just have itself and let Maven deal with transitive dependencies. I've some experience regarding this: if no one minds I would like to work on this. [~gsomogyi] Are you OK with this? Or did you already take a step on this? > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`
Ivan Vergiliev created SPARK-27105: -- Summary: Prevent exponential complexity in ORC `createFilter` Key: SPARK-27105 URL: https://issues.apache.org/jira/browse/SPARK-27105 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Ivan Vergiliev `OrcFilters.createFilters` currently has complexity that's exponential in the height of the filter tree. There are multiple places in Spark that try to prevent the generation of skewed trees so as to not trigger this behaviour, for example: - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` combines a number of binary logical expressions into a balanced tree. - https://github.com/apache/spark/pull/22313 introduced a change to `OrcFilters` to create a balanced tree instead of a skewed tree. However, the underlying exponential behaviour can still be triggered by code paths that don't go through any of the tree balancing methods. For example, if one generates a tree of `Column`s directly in user code, there's nothing in Spark that automatically balances that tree and, hence, skewed trees hit the exponential behaviour. We have hit this in production with jobs mysteriously taking hours on the Spark driver with no worker activity, with as few as ~30 OR filters. I have a fix locally that makes the underlying logic have linear complexity instead of exponential complexity. With this fix, the code can handle thousands of filters in milliseconds. I'll send a PR with the fix soon. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787762#comment-16787762 ] Jungtaek Lim commented on SPARK-27044: -- It's like "hive-exec" with default classifier vs "core" classifier. Former includes all transitive dependencies (without relocating) whereas latter just have itself and let Maven deal with transitive dependencies. I've some experience regarding this: if no one minds I would like to work on this. [~gsomogyi] Are you OK with this? Or did you already take a step on this? > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers
[ https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787754#comment-16787754 ] Gabor Somogyi commented on SPARK-27044: --- [~herberts] Could you give an example of {quote}fetch 'uber' versions of packages{quote}? > Maven dependency resolution does not support classifiers > > > Key: SPARK-27044 > URL: https://issues.apache.org/jira/browse/SPARK-27044 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Mathias Herberts >Priority: Major > > The spark-submit --packages option allows dependencies to be specified using > the maven syntax group:artifact:version, but it does not support specifying a > classifier as group:artifact:version:classifier > This makes it impossible to fetch 'uber' versions of packages for example as > they are usually specified using a classifier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787749#comment-16787749 ] Hyukjin Kwon commented on SPARK-27100: -- Hm, can you show reproducible steps including codes? I think it should be checked if there's the same issue in the current master branch or not. > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Attachments: stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27087) Inability to access to column alias in pyspark
[ https://issues.apache.org/jira/browse/SPARK-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27087. -- Resolution: Invalid > Inability to access to column alias in pyspark > -- > > Key: SPARK-27087 > URL: https://issues.apache.org/jira/browse/SPARK-27087 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Vincent >Priority: Minor > > In pyspark I have the following: > {code:java} > import pyspark.sql.functions as F > cc = F.lit(1).alias("A") > print(cc) > print(cc._jc.toString()) > {code} > I get : > {noformat} > Column > 1 AS `A` > {noformat} > Is there any way for me to just print "A" from cc ? it seems I'm unable to > extract the alias programatically from the column object. > Also I think that in spark-sql in scala, if I print "cc" it would just print > "A" instead, so this seem like a bug or a missing feature to me -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27087) Inability to access to column alias in pyspark
[ https://issues.apache.org/jira/browse/SPARK-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787732#comment-16787732 ] Hyukjin Kwon commented on SPARK-27087: -- Scala side shows the same output as well: {code} scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> lit(1).alias("A").toString res2: String = 1 AS `A` {code} It shows the expression as is. So it's correct to show the expression in string as is. not a bug or missing feature. > Inability to access to column alias in pyspark > -- > > Key: SPARK-27087 > URL: https://issues.apache.org/jira/browse/SPARK-27087 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Vincent >Priority: Minor > > In pyspark I have the following: > {code:java} > import pyspark.sql.functions as F > cc = F.lit(1).alias("A") > print(cc) > print(cc._jc.toString()) > {code} > I get : > {noformat} > Column > 1 AS `A` > {noformat} > Is there any way for me to just print "A" from cc ? it seems I'm unable to > extract the alias programatically from the column object. > Also I think that in spark-sql in scala, if I print "cc" it would just print > "A" instead, so this seem like a bug or a missing feature to me -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27027: - Component/s: Spark Shell > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787722#comment-16787722 ] Hyukjin Kwon commented on SPARK-27027: -- [~hluu], just to confirm, did you face this issue in Spark shell? > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27027: - Affects Version/s: 3.0.0 > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-27027: -- > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787720#comment-16787720 ] Gabor Somogyi commented on SPARK-27027: --- After compile {quote}./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.0-SNAPSHOT{quote} but as far as I remember some changes were needed in the commands. > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables
[ https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787716#comment-16787716 ] Saurabh Santhosh commented on SPARK-23012: -- [~yumwang] [~reks95] Hi, tested this in Spark 2.4.0 and its woking fine :) > Support for predicate pushdown and partition pruning when left joining large > Hive tables > > > Key: SPARK-23012 > URL: https://issues.apache.org/jira/browse/SPARK-23012 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Rick Kramer >Priority: Major > > We have a hive view which left outer joins several large, partitioned orc > hive tables together on date. When the view is used in a hive query, hive > pushes date predicates down into the joins and prunes the partitions for all > tables. When I use this view from pyspark, the predicate is only used to > prune the left-most table and all partitions from the additional tables are > selected. > For example, consider two partitioned hive tables a & b joined in a view: > create table a ( >a_val string > ) > partitioned by (ds string) > stored as orc; > create table b ( >b_val string > ) > partitioned by (ds string) > stored as orc; > create view example_view as > select > a_val > , b_val > , ds > from a > left outer join b on b.ds = a.ds > Then in pyspark you might try to query from the view filtering on ds: > spark.table('example_view').filter(F.col('ds') == '2018-01-01') > If table a and b are large, this results in a plan that filters a on ds = > 2018-01-01, but selects scans all partitions of table b. > If the join in the view is changed to an inner join, the predicate gets > pushed down to a & b and the partitions are pruned as you'd expect. > In practice, the view is fairly complex and contains a lot of business logic > we'd prefer not to replicate in pyspark if we can avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787708#comment-16787708 ] Hyukjin Kwon commented on SPARK-27027: -- How did you run spark-shell? > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787701#comment-16787701 ] Gabor Somogyi commented on SPARK-27027: --- With spark-shell, in unit test the functionality works. > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787695#comment-16787695 ] Hyukjin Kwon commented on SPARK-27027: -- So, how did we reproduce in the current master? Please set the affected version to 3.0.0 too and fix the JIRA description. > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27104) Spark Fair scheduler across applications in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-27104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Zhang updated SPARK-27104: -- Description: Spark in standalone mode currently only supports FIFO (first-in-first-out) scheduler across applications. It will be great that a fair scheduler is supported. +A fair scheduler across applications, not in a application.+ Use case (for example with the integration of zeppelin) At certain moment, user A submits a heavy application and consumes all the resources of the spark cluster. At a later moment, user B submits a second application. No matter how many work nodes you added now, all the resources go to user A due to the FIFO. User B will never get any resource until user A release its allocated resources. A fair scheduler should distribute extra resources in a fair way on all running applications, which demands resources. was: Spark in standalone mode currently only supports FIFO (first-in-first-out) scheduler across applications. It will be great that a fair scheduler is supported. +A fair scheduler across applications, not in a application.+ Use case (for example with the integration of zeppelin) At certain moment, user A submits an heavy application and consumes all the resources of the spark cluster. At a later moment, user B submits a second application. No matter how many work nodes you added now, all the resources go to user A due to the FIFO. User B will never get any resource until user A release its allocated resources. A fair scheduler should distribute extra resources in a fair way on all running applications, which demands resources. > Spark Fair scheduler across applications in standalone mode > --- > > Key: SPARK-27104 > URL: https://issues.apache.org/jira/browse/SPARK-27104 > Project: Spark > Issue Type: Wish > Components: Scheduler >Affects Versions: 2.2.3, 2.3.3, 2.4.0 >Reporter: Hua Zhang >Priority: Minor > > Spark in standalone mode currently only supports FIFO (first-in-first-out) > scheduler across applications. > It will be great that a fair scheduler is supported. +A fair scheduler across > applications, not in a application.+ > > Use case (for example with the integration of zeppelin) > At certain moment, user A submits a heavy application and consumes all the > resources of the spark cluster. > At a later moment, user B submits a second application. > No matter how many work nodes you added now, all the resources go to user A > due to the FIFO. User B will never get any resource until user A release its > allocated resources. > > A fair scheduler should distribute extra resources in a fair way on all > running applications, which demands resources. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27104) Spark Fair scheduler across applications in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-27104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Zhang updated SPARK-27104: -- Description: Spark in standalone mode currently only supports FIFO (first-in-first-out) scheduler across applications. It will be great that a fair scheduler is supported. +A fair scheduler across applications, not in a application.+ Use case (for example with the integration of zeppelin) At certain moment, user A submits an heavy application and consumes all the resources of the spark cluster. At a later moment, user B submits a second application. No matter how many work nodes you added now, all the resources go to user A due to the FIFO. User B will never get any resource until user A release its allocated resources. A fair scheduler should distribute extra resources in a fair way on all running applications, which demands resources. was: Spark in standalone mode currently only supports FIFO (first-in-first-out) scheduler across applications. It will be great that a fair scheduler is supported. +A fair scheduler across applications, not in a application.+ Use case (for example with the integration of zeppelin) At certain moment, user A submits an heavy application en consumes all the resources of the spark cluster. At a later moment, user B submits a second application. No matter how many work nodes you added now, all the resources go to user A due to the FIFO. User B will never get any resource until user A release its allocated resources. A fair scheduler should distribute extra resources in a fair way on all running applications, which demands resources. > Spark Fair scheduler across applications in standalone mode > --- > > Key: SPARK-27104 > URL: https://issues.apache.org/jira/browse/SPARK-27104 > Project: Spark > Issue Type: Wish > Components: Scheduler >Affects Versions: 2.2.3, 2.3.3, 2.4.0 >Reporter: Hua Zhang >Priority: Minor > > Spark in standalone mode currently only supports FIFO (first-in-first-out) > scheduler across applications. > It will be great that a fair scheduler is supported. +A fair scheduler across > applications, not in a application.+ > > Use case (for example with the integration of zeppelin) > At certain moment, user A submits an heavy application and consumes all the > resources of the spark cluster. > At a later moment, user B submits a second application. > No matter how many work nodes you added now, all the resources go to user A > due to the FIFO. User B will never get any resource until user A release its > allocated resources. > > A fair scheduler should distribute extra resources in a fair way on all > running applications, which demands resources. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27104) Spark Fair scheduler across applications in standalone mode
Hua Zhang created SPARK-27104: - Summary: Spark Fair scheduler across applications in standalone mode Key: SPARK-27104 URL: https://issues.apache.org/jira/browse/SPARK-27104 Project: Spark Issue Type: Wish Components: Scheduler Affects Versions: 2.4.0, 2.3.3, 2.2.3 Reporter: Hua Zhang Spark in standalone mode currently only supports FIFO (first-in-first-out) scheduler across applications. It will be great that a fair scheduler is supported. +A fair scheduler across applications, not in a application.+ Use case (for example with the integration of zeppelin) At certain moment, user A submits an heavy application en consumes all the resources of the spark cluster. At a later moment, user B submits a second application. No matter how many work nodes you added now, all the resources go to user A due to the FIFO. User B will never get any resource until user A release its allocated resources. A fair scheduler should distribute extra resources in a fair way on all running applications, which demands resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787662#comment-16787662 ] Gabor Somogyi commented on SPARK-27027: --- Please see the comments. The issue fully reproducible. > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787662#comment-16787662 ] Gabor Somogyi edited comment on SPARK-27027 at 3/8/19 8:34 AM: --- Please see the comments. The issue fully reproducible even on master. was (Author: gsomogyi): Please see the comments. The issue fully reproducible. > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org