[jira] [Assigned] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27036:


Assignee: (was: Apache Spark)

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
>  
> !image-2019-03-04-00-39-38-779.png!
> !image-2019-03-04-00-39-12-210.png!
>  
>  wait for some time
> !image-2019-03-04-00-38-52-401.png!
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
>  create Table t1(Big Table,1M Records)
>  val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
>  val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
>  spark.sql("set spark.sql.broadcastTimeout=2").show(false)
>  Run Below Query 
>  spark.sql("create table s using parquet as select t1.* from csv_2 as 
> t1,csv_1 as t2 where t1._c3=t2._c3")
> Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
> datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27036) Even Broadcast thread is timed out, BroadCast Job is not aborted.

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27036:


Assignee: Apache Spark

> Even Broadcast thread is timed out, BroadCast Job is not aborted.
> -
>
> Key: SPARK-27036
> URL: https://issues.apache.org/jira/browse/SPARK-27036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Babulal
>Assignee: Apache Spark
>Priority: Minor
> Attachments: image-2019-03-04-00-38-52-401.png, 
> image-2019-03-04-00-39-12-210.png, image-2019-03-04-00-39-38-779.png
>
>
> During broadcast table job is execution if broadcast timeout 
> (spark.sql.broadcastTimeout) happens ,broadcast Job still continue till 
> completion whereas it should abort on broadcast timeout.
> Exception is thrown in console  but Spark Job is still continue.
>  
> !image-2019-03-04-00-39-38-779.png!
> !image-2019-03-04-00-39-12-210.png!
>  
>  wait for some time
> !image-2019-03-04-00-38-52-401.png!
> !image-2019-03-04-00-34-47-884.png!
>  
> How to Reproduce Issue
> Option1 using SQL:- 
>  create Table t1(Big Table,1M Records)
>  val rdd1=spark.sparkContext.parallelize(1 to 100,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t4')");
> create Table t2(Small Table,100K records)
>  val rdd1=spark.sparkContext.parallelize(1 to 10,100).map(x=> 
> ("name_"+x,x%3,x))
>  val df=rdd1.toDF.selectExpr("_1 as name","_2 as age","_3 as sal","_1 as 
> c1","_1 as c2","_1 as c3","_1 as c4","_1 as c5","_1 as c6","_1 as c7","_1 as 
> c8","_1 as c9","_1 as c10","_1 as c11","_1 as c12","_1 as c13","_1 as 
> c14","_1 as c15","_1 as c16","_1 as c17","_1 as c18","_1 as c19","_1 as 
> c20","_1 as c21","_1 as c22","_1 as c23","_1 as c24","_1 as c25","_1 as 
> c26","_1 as c27","_1 as c28","_1 as c29","_1 as c30")
>  df.write.csv("D:/data/par1/t4");
>  spark.sql("create table csv_2 using csv options('path'='D:/data/par1/t5')");
> spark.sql("set spark.sql.autoBroadcastJoinThreshold=73400320").show(false)
>  spark.sql("set spark.sql.broadcastTimeout=2").show(false)
>  Run Below Query 
>  spark.sql("create table s using parquet as select t1.* from csv_2 as 
> t1,csv_1 as t2 where t1._c3=t2._c3")
> Option 2:- Use External DataSource and Add Delay in the #buildScan. and use 
> datasource for query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27113) remove CHECK_FILES_EXIST_KEY option in file source

2019-03-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27113:
---

 Summary: remove CHECK_FILES_EXIST_KEY option in file source
 Key: SPARK-27113
 URL: https://issues.apache.org/jira/browse/SPARK-27113
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27056.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23975
[https://github.com/apache/spark/pull/23975]

> Remove  `start-shuffle-service.sh`
> --
>
> Key: SPARK-27056
> URL: https://issues.apache.org/jira/browse/SPARK-27056
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> _start-shuffle-service.sh_ was only used by Mesos before 
> _start-mesos-shuffle-service.sh_.
> Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is 
> better than _start-shuffle-service.sh_.
> So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27056) Remove `start-shuffle-service.sh`

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27056:
-

Assignee: liuxian

> Remove  `start-shuffle-service.sh`
> --
>
> Key: SPARK-27056
> URL: https://issues.apache.org/jira/browse/SPARK-27056
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
>
> _start-shuffle-service.sh_ was only used by Mesos before 
> _start-mesos-shuffle-service.sh_.
> Obviously, _start-mesos-shuffle-service.sh_ solves some problems, it is 
> better than _start-shuffle-service.sh_.
> So now we should delete _start-shuffle-service.sh_ in case users use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27101) Drop the created database after the test in test_session

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27101:
-
Summary: Drop the created database after the test in test_session  (was: 
clean up testcase in SparkSessionTests3)

> Drop the created database after the test in test_session
> 
>
> Key: SPARK-27101
> URL: https://issues.apache.org/jira/browse/SPARK-27101
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Sandeep Katta
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27101) Drop the created database after the test in test_session

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27101.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24021
[https://github.com/apache/spark/pull/24021]

> Drop the created database after the test in test_session
> 
>
> Key: SPARK-27101
> URL: https://issues.apache.org/jira/browse/SPARK-27101
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27101) Drop the created database after the test in test_session

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27101:


Assignee: Sandeep Katta

> Drop the created database after the test in test_session
> 
>
> Key: SPARK-27101
> URL: https://issues.apache.org/jira/browse/SPARK-27101
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Sandeep Katta
>Assignee: Sandeep Katta
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27112:


Assignee: Apache Spark

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27111:


Assignee: Apache Spark  (was: Shixiong Zhu)

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27112:


Assignee: (was: Apache Spark)

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27111:


Assignee: Shixiong Zhu  (was: Apache Spark)

> A continuous query may fail with InterruptedException when kafka consumer 
> temporally 0 partitions temporally
> 
>
> Key: SPARK-27111
> URL: https://issues.apache.org/jira/browse/SPARK-27111
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Before a Kafka consumer gets assigned with partitions, its offset will 
> contain 0 partitions. However, runContinuous will still run and launch a 
> Spark job having 0 partitions. In this case, there is a race that epoch may 
> interrupt the query execution thread after `lastExecution.toRdd`, and either 
> `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
> `runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Parth Gandhi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-27112:
-
Attachment: Screen Shot 2019-02-26 at 4.10.48 PM.png

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Parth Gandhi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-27112:
-
Attachment: Screen Shot 2019-02-26 at 4.11.26 PM.png

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Parth Gandhi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-27112:
-
Attachment: Screen Shot 2019-02-26 at 4.11.11 PM.png

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Parth Gandhi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-27112:
-
Attachment: Screen Shot 2019-02-26 at 4.10.26 PM.png

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-08 Thread Parth Gandhi (JIRA)
Parth Gandhi created SPARK-27112:


 Summary: Spark Scheduler encounters two independent Deadlocks when 
trying to kill executors either due to dynamic allocation or blacklisting 
 Key: SPARK-27112
 URL: https://issues.apache.org/jira/browse/SPARK-27112
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 2.4.0, 3.0.0
Reporter: Parth Gandhi


Recently, a few spark users in the organization have reported that their jobs 
were getting stuck. On further analysis, it was found out that there exist two 
independent deadlocks and either of them occur under different circumstances. 
The screenshots for these two deadlocks are attached here. 

We were able to reproduce the deadlocks with the following piece of code:

 
{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

import org.apache.spark._
import org.apache.spark.TaskContext

// Simple example of Word Count in Scala
object ScalaWordCount {
def main(args: Array[String]) {

if (args.length < 2) {
System.err.println("Usage: ScalaWordCount  ")
System.exit(1)
}

val conf = new SparkConf().setAppName("Scala Word Count")
val sc = new SparkContext(conf)

// get the input file uri
val inputFilesUri = args(0)

// get the output file uri
val outputFilesUri = args(1)

while (true) {
val textFile = sc.textFile(inputFilesUri)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => {if (TaskContext.get.partitionId == 5 && 
TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
blacklisting") else (word, 1)})
.reduceByKey(_ + _)
counts.saveAsTextFile(outputFilesUri)
val conf: Configuration = new Configuration()
val path: Path = new Path(outputFilesUri)
val hdfs: FileSystem = FileSystem.get(conf)
hdfs.delete(path, true)
}

sc.stop()
}
}
{code}
 

Additionally, to ensure that the deadlock surfaces up soon enough, I also added 
a small delay in the Spark code here:

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]

 
{code:java}
executorIdToFailureList.remove(exec)
updateNextExpiryTime()
Thread.sleep(2000)
killBlacklistedExecutor(exec)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27111) A continuous query may fail with InterruptedException when kafka consumer temporally 0 partitions temporally

2019-03-08 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27111:


 Summary: A continuous query may fail with InterruptedException 
when kafka consumer temporally 0 partitions temporally
 Key: SPARK-27111
 URL: https://issues.apache.org/jira/browse/SPARK-27111
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Before a Kafka consumer gets assigned with partitions, its offset will contain 
0 partitions. However, runContinuous will still run and launch a Spark job 
having 0 partitions. In this case, there is a race that epoch may interrupt the 
query execution thread after `lastExecution.toRdd`, and either 
`epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next 
`runContinuous` will get interrupted unintentionally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27004) Code for https uri authentication in Spark Submit needs to be removed

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27004:


Assignee: (was: Apache Spark)

> Code for https uri authentication in Spark Submit needs to be removed
> -
>
> Key: SPARK-27004
> URL: https://issues.apache.org/jira/browse/SPARK-27004
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> The old code in Spark Submit used for uri verification according to the 
> comments 
> [here|https://github.com/apache/spark/pull/23546#issuecomment-463340476] and 
> [here|https://github.com/apache/spark/pull/23546#issuecomment-463366075] 
> needs to be removed or refactored otherwise it will cause failures with 
> secure http uris.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27004) Code for https uri authentication in Spark Submit needs to be removed

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27004:


Assignee: Apache Spark

> Code for https uri authentication in Spark Submit needs to be removed
> -
>
> Key: SPARK-27004
> URL: https://issues.apache.org/jira/browse/SPARK-27004
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Minor
>
> The old code in Spark Submit used for uri verification according to the 
> comments 
> [here|https://github.com/apache/spark/pull/23546#issuecomment-463340476] and 
> [here|https://github.com/apache/spark/pull/23546#issuecomment-463366075] 
> needs to be removed or refactored otherwise it will cause failures with 
> secure http uris.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27044.

Resolution: Duplicate

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-08 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788369#comment-16788369
 ] 

Imran Rashid commented on SPARK-27097:
--

I'm kind of amazed Spark works at all on different Platforms.  As you note, 
endianness probably cannot be different.  What kind of platform difference 
results in this issue?  Is it different versions of the JVM?  I'd also be 
amazed if that worked properly.

I'm not saying we shouldn't fix this if its easy, but maybe we should clarify 
how different the "platform" can be between containers in a spark app?

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Critical
>  Labels: correctness
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse

2019-03-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788354#comment-16788354
 ] 

Apache Spark commented on SPARK-27110:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22204

> Moves some functions from AnalyzeColumnCommand to command/CommandUtils for 
> reuse
> 
>
> Key: SPARK-27110
> URL: https://issues.apache.org/jira/browse/SPARK-27110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> To reuse some common logics for improving {{Analyze}} commands (See the 
> description of {{SPARK-25196}}for details), this ticket targets to move some 
> functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse

2019-03-08 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-27110:
-
Description: To reuse some common logics for improving {{Analyze}} commands 
(See the description of {{SPARK-25196}}for details), this ticket targets to 
move some functions from AnalyzeColumnCommand}} to {{command/CommandUtils}}.   
(was: To reuse some common logics for improving {{Analyze}} commands (See the 
description of {{SPARK-25196}}for details), this ticket targets to move some 
functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. )

> Moves some functions from AnalyzeColumnCommand to command/CommandUtils for 
> reuse
> 
>
> Key: SPARK-27110
> URL: https://issues.apache.org/jira/browse/SPARK-27110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> To reuse some common logics for improving {{Analyze}} commands (See the 
> description of {{SPARK-25196}}for details), this ticket targets to move some 
> functions from AnalyzeColumnCommand}} to {{command/CommandUtils}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27110:


Assignee: (was: Apache Spark)

> Moves some functions from AnalyzeColumnCommand to command/CommandUtils for 
> reuse
> 
>
> Key: SPARK-27110
> URL: https://issues.apache.org/jira/browse/SPARK-27110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> To reuse some common logics for improving {{Analyze}} commands (See the 
> description of {{SPARK-25196}}for details), this ticket targets to move some 
> functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27110:


Assignee: Apache Spark

> Moves some functions from AnalyzeColumnCommand to command/CommandUtils for 
> reuse
> 
>
> Key: SPARK-27110
> URL: https://issues.apache.org/jira/browse/SPARK-27110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> To reuse some common logics for improving {{Analyze}} commands (See the 
> description of {{SPARK-25196}}for details), this ticket targets to move some 
> functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27110) Moves some functions from AnalyzeColumnCommand to command/CommandUtils for reuse

2019-03-08 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-27110:


 Summary: Moves some functions from AnalyzeColumnCommand to 
command/CommandUtils for reuse
 Key: SPARK-27110
 URL: https://issues.apache.org/jira/browse/SPARK-27110
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


To reuse some common logics for improving {{Analyze}} commands (See the 
description of {{SPARK-25196}}for details), this ticket targets to move some 
functions from {{AnalyzeColumnCommand}} to {{command/CommandUtils}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27109:


Assignee: (was: Apache Spark)

> Refactoring of TimestampFormatter and DateFormatter
> ---
>
> Key: SPARK-27109
> URL: https://issues.apache.org/jira/browse/SPARK-27109
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> * Date/TimestampFormatter converts parsed input to Instant before converting 
> it to days/micros. This is unnecessary conversion because seconds and 
> fraction of second can be extracted (calculated) from ZoneDateTime directly
>  * Avoid additional extraction of TemporalQueries.localTime from 
> temporalAccessor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27097:


Assignee: Apache Spark  (was: Kris Mok)

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27097) Avoid embedding platform-dependent offsets literally in whole-stage generated code

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27097:


Assignee: Kris Mok  (was: Apache Spark)

> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code
> --
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Kris Mok
>Priority: Critical
>  Labels: correctness
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated 
> code.
> Spark SQL performs whole-stage code generation to speed up query execution. 
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A 
> single version of the source code is generated from a query plan, and sent to 
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before 
> sending to executors, but currently only the generated source code gets sent 
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the 
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors 
> being run on similar platforms. Some code paths accidentally embedded 
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets 
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information 
> from the driver. If the object layout is significantly different on the 
> driver and executors, the generated code can be reading/writing to wrong 
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX 
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into 
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer, 
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct 
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static 
> final in Java, so they're not compile-time constants from the Java language's 
> perspective. It does lead to a slightly increased size of the generated code, 
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on 
> the driver which is invalid on the executors. e.g. if the endianness is 
> different between the driver and the executors, and if some generated code 
> makes strong assumption about endianness, it would also be problematic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27109:


Assignee: Apache Spark

> Refactoring of TimestampFormatter and DateFormatter
> ---
>
> Key: SPARK-27109
> URL: https://issues.apache.org/jira/browse/SPARK-27109
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> * Date/TimestampFormatter converts parsed input to Instant before converting 
> it to days/micros. This is unnecessary conversion because seconds and 
> fraction of second can be extracted (calculated) from ZoneDateTime directly
>  * Avoid additional extraction of TemporalQueries.localTime from 
> temporalAccessor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27109) Refactoring of TimestampFormatter and DateFormatter

2019-03-08 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27109:
--

 Summary: Refactoring of TimestampFormatter and DateFormatter
 Key: SPARK-27109
 URL: https://issues.apache.org/jira/browse/SPARK-27109
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


* Date/TimestampFormatter converts parsed input to Instant before converting it 
to days/micros. This is unnecessary conversion because seconds and fraction of 
second can be extracted (calculated) from ZoneDateTime directly
 * Avoid additional extraction of TemporalQueries.localTime from 
temporalAccessor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8547) xgboost exploration

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8547.
--
Resolution: Won't Fix

I think the resolution is "xgboost4j-scala"

> xgboost exploration
> ---
>
> Key: SPARK-8547
> URL: https://issues.apache.org/jira/browse/SPARK-8547
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> There has been quite a bit of excitement around xgboost: 
> [https://github.com/dmlc/xgboost]
> It improves the parallelism of boosting by mixing boosting and bagging (where 
> bagging makes the algorithm more parallel).
> It would worth exploring implementing this within MLlib (probably as a new 
> algorithm).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9610) Class and instance weighting for ML

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9610.
--
Resolution: Done

Except for GBTs, this is done, so I'm going to close the umbrella

> Class and instance weighting for ML
> ---
>
> Key: SPARK-9610
> URL: https://issues.apache.org/jira/browse/SPARK-9610
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This umbrella is for tracking tasks for adding support for label or instance 
> weights to ML algorithms.  These additions will help handle skewed or 
> imbalanced data, ensemble methods, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9478) Add sample weights to Random Forest

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9478.
--
Resolution: Duplicate

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>Priority: Major
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14599) BaggedPoint should support weighted instances.

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14599.
---
Resolution: Duplicate

Weighted points were added with SPARK-19591

> BaggedPoint should support weighted instances.
> --
>
> Key: SPARK-14599
> URL: https://issues.apache.org/jira/browse/SPARK-14599
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Major
>
> This JIRA addresses a TODO in bagged point to support individual sample 
> weights. This is a blocker for 
> [SPARK-9478|https://issues.apache.org/jira/browse/SPARK-9478]. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27108) Add parsed CreateTable plans to Catalyst

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27108:


Assignee: (was: Apache Spark)

> Add parsed CreateTable plans to Catalyst
> 
>
> Key: SPARK-27108
> URL: https://issues.apache.org/jira/browse/SPARK-27108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ryan Blue
>Priority: Major
>
> The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} 
> commands. Creates are handled only by {{SparkSqlParser}} because the logical 
> plans are defined in the v1 datasource package 
> (org.apache.spark.sql.execution.datasources).
> The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like 
> converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult 
> (and error-prone) to produce v2 plans because it requires converting the AST 
> to v1 and the converting v1 to v2.
> Instead, the catalyst parser should create plans that represent exactly what 
> was parsed, after validation like ensuring no duplicate clauses. Then those 
> plans should be converted to v1 or v2 plans in the analyzer. This structure 
> will avoid errors caused by multiple layers of translation and keeps v1 and 
> v2 plans separate to ensure that v1 has no behavior changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27108) Add parsed CreateTable plans to Catalyst

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27108:


Assignee: Apache Spark

> Add parsed CreateTable plans to Catalyst
> 
>
> Key: SPARK-27108
> URL: https://issues.apache.org/jira/browse/SPARK-27108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} 
> commands. Creates are handled only by {{SparkSqlParser}} because the logical 
> plans are defined in the v1 datasource package 
> (org.apache.spark.sql.execution.datasources).
> The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like 
> converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult 
> (and error-prone) to produce v2 plans because it requires converting the AST 
> to v1 and the converting v1 to v2.
> Instead, the catalyst parser should create plans that represent exactly what 
> was parsed, after validation like ensuring no duplicate clauses. Then those 
> plans should be converted to v1 or v2 plans in the analyzer. This structure 
> will avoid errors caused by multiple layers of translation and keeps v1 and 
> v2 plans separate to ensure that v1 has no behavior changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2019-03-08 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788206#comment-16788206
 ] 

Eyal Farago commented on SPARK-17556:
-

why was this abandoned?

[~viirya]'s pull request seems promising.

I think the last comment by [~LI,Xiao] applies for current implementation as 
well as executors hold the entire broadcast anyway (assuming they ran task that 
used it) - so memory footprint on the executors side doesn't change, re. 
performance regression in case of multiple smaller partitions this also applies 
for current implementation as the RDD partitions has to be calculated and 
transferred to the driver.

one thing I personally think could be improved in [~viirya]'s PR was the 
requirement for the RDD to be pre-persisted, I think blocks could be evaluated 
in the mapPartition operation performed in the newly introduced RDD.broadcast 
method, this would have solved most comments by [~holdenk_amp] in the PR.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Priority: Major
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27108) Add parsed CreateTable plans to Catalyst

2019-03-08 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27108:
-

 Summary: Add parsed CreateTable plans to Catalyst
 Key: SPARK-27108
 URL: https://issues.apache.org/jira/browse/SPARK-27108
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Ryan Blue


The abstract Catalyst SQL AST builder cannot currently parse {{CREATE TABLE}} 
commands. Creates are handled only by {{SparkSqlParser}} because the logical 
plans are defined in the v1 datasource package 
(org.apache.spark.sql.execution.datasources).

The {{SparkSqlParser}} mixes parsing with logic that is specific to v1, like 
converting {{IF NOT EXISTS}} into a {{SaveMode}}. This makes it difficult (and 
error-prone) to produce v2 plans because it requires converting the AST to v1 
and the converting v1 to v2.

Instead, the catalyst parser should create plans that represent exactly what 
was parsed, after validation like ensuring no duplicate clauses. Then those 
plans should be converted to v1 or v2 plans in the analyzer. This structure 
will avoid errors caused by multiple layers of translation and keeps v1 and v2 
plans separate to ensure that v1 has no behavior changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27103) SparkSql reserved keywords don't list in alphabet order

2019-03-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27103.
---
   Resolution: Fixed
 Assignee: SongYadong
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23985

> SparkSql reserved keywords don't list in alphabet order
> ---
>
> Key: SPARK-27103
> URL: https://issues.apache.org/jira/browse/SPARK-27103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: SongYadong
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Current spark-sql reserved keywords hasn't listed in alphabetical order.
> In test suite some repeated words need to remove. Also better to add some 
> comments for remind.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27093) Honor ParseMode in AvroFileFormat

2019-03-08 Thread Tim Cerexhe (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788177#comment-16788177
 ] 

Tim Cerexhe commented on SPARK-27093:
-

Thanks for that idea [~Gengliang.Wang]. It catches some of the failure modes we 
need to protect against.

However we also have files that do not conform to the requisite schema (or have 
malformed schemas, eg. with invalid byte sequences in keys, since they come 
from user uploads over faulty networks), and these exceptions aren't currently 
being squashed.

If these failure modes were treated as "corrupt" files then this would 
completely satisfy our needs (though this may be a stretch of the definition).

I've uploaded our internal patch for your reference: 
https://github.com/apache/spark/pull/24027

> Honor ParseMode in AvroFileFormat
> -
>
> Key: SPARK-27093
> URL: https://issues.apache.org/jira/browse/SPARK-27093
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Tim Cerexhe
>Priority: Major
>
> The Avro reader is missing the ability to handle malformed or truncated files 
> like the JSON reader. Currently it throws exceptions when it encounters any 
> bad or truncated record in an Avro file, causing the entire Spark job to fail 
> from a single dodgy file. 
> Ideally the AvroFileFormat would accept a Permissive or DropMalformed 
> ParseMode like Spark's JSON format. This would enable the the Avro reader to 
> drop bad records and continue processing the good records rather than abort 
> the entire job. 
> Obviously the default could remain as FailFastMode, which is the current 
> effective behavior, so this wouldn’t break any existing users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27093) Honor ParseMode in AvroFileFormat

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27093:


Assignee: (was: Apache Spark)

> Honor ParseMode in AvroFileFormat
> -
>
> Key: SPARK-27093
> URL: https://issues.apache.org/jira/browse/SPARK-27093
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Tim Cerexhe
>Priority: Major
>
> The Avro reader is missing the ability to handle malformed or truncated files 
> like the JSON reader. Currently it throws exceptions when it encounters any 
> bad or truncated record in an Avro file, causing the entire Spark job to fail 
> from a single dodgy file. 
> Ideally the AvroFileFormat would accept a Permissive or DropMalformed 
> ParseMode like Spark's JSON format. This would enable the the Avro reader to 
> drop bad records and continue processing the good records rather than abort 
> the entire job. 
> Obviously the default could remain as FailFastMode, which is the current 
> effective behavior, so this wouldn’t break any existing users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27093) Honor ParseMode in AvroFileFormat

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27093:


Assignee: Apache Spark

> Honor ParseMode in AvroFileFormat
> -
>
> Key: SPARK-27093
> URL: https://issues.apache.org/jira/browse/SPARK-27093
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Tim Cerexhe
>Assignee: Apache Spark
>Priority: Major
>
> The Avro reader is missing the ability to handle malformed or truncated files 
> like the JSON reader. Currently it throws exceptions when it encounters any 
> bad or truncated record in an Avro file, causing the entire Spark job to fail 
> from a single dodgy file. 
> Ideally the AvroFileFormat would accept a Permissive or DropMalformed 
> ParseMode like Spark's JSON format. This would enable the the Avro reader to 
> drop bad records and continue processing the good records rather than abort 
> the entire job. 
> Obviously the default could remain as FailFastMode, which is the current 
> effective behavior, so this wouldn’t break any existing users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27079) Fix typo & Remove useless imports

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27079:
-

Assignee: EdisonWang

> Fix typo & Remove useless imports
> -
>
> Key: SPARK-27079
> URL: https://issues.apache.org/jira/browse/SPARK-27079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27079) Fix typo & Remove useless imports

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27079.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24000
[https://github.com/apache/spark/pull/24000]

> Fix typo & Remove useless imports
> -
>
> Key: SPARK-27079
> URL: https://issues.apache.org/jira/browse/SPARK-27079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27079) Fix typo & Remove useless imports

2019-03-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27079:
--
Priority: Trivial  (was: Minor)

Yeah [~EdisonWang] this isnt' meaningful as a JIRA. If it's truly minor, in 
that describing the minor issue you're addressing is about the same as just 
making the PR, just skip the JIRA

> Fix typo & Remove useless imports
> -
>
> Key: SPARK-27079
> URL: https://issues.apache.org/jira/browse/SPARK-27079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27090:


Assignee: (was: Apache Spark)

> Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27090:


Assignee: Apache Spark

> Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hien Luu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788075#comment-16788075
 ] 

Hien Luu commented on SPARK-27027:
--

Hi [~hyukjin.kwon],  yes I did and here is how I started the spark-shell on my 
Mac.

./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0

My spark version is:

scala> spark.version

res0: String = 2.4.0

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-08 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788021#comment-16788021
 ] 

Dhruve Ashar commented on SPARK-27107:
--

[~dongjoon] can you review the PR for ORC to fix this issue? 
[https://github.com/apache/orc/pull/372]

Once this is merged, we can fix the issue in spark as well. Until then the only 
workaround is to use the hive based implementation.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> 

[jira] [Created] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-08 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-27107:


 Summary: Spark SQL Job failing because of Kryo buffer overflow 
with ORC
 Key: SPARK-27107
 URL: https://issues.apache.org/jira/browse/SPARK-27107
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: Dhruve Ashar


The issue occurs while trying to read ORC data and setting the SearchArgument.
{code:java}
 Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
Available: 0, required: 9
Serialization trace:
literalList 
(org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
at 
org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
at 
org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)

[jira] [Commented] (SPARK-25928) NoSuchMethodError net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V

2019-03-08 Thread Abhinay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787996#comment-16787996
 ] 

Abhinay commented on SPARK-25928:
-

We are running into the same issue with EMR 5.20.0 with Oozie 5.0.0 and Spark 
2.4.0. The workaround seemed to be working for now but it does require some 
additional setup either converting to Snappy or deleting the lz 1.4.0 in Oozie 
share lib. 

 

The feedback from AWS support team was it was more of an upstream issue with 
Spark. Can we know when this issue will be resolved?

 

> NoSuchMethodError 
> net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
> -
>
> Key: SPARK-25928
> URL: https://issues.apache.org/jira/browse/SPARK-25928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.1
> Environment: EMR 5.17 which is using oozie 5.0.0 and spark 2.3.1
>Reporter: Jerry Chabot
>Priority: Major
>
> I am not sure if this is an Oozie problem, a Spark problem or a user error. 
> It is blocking our upcoming release.
> We are upgrading from Amazon's EMR 5.7 to EMR 5.17. The version changes are:
>     oozie 4.3.0 -> 5.0.0
>      spark 2.1.1 -> 2.3.1
> All our Oozie/Spark jobs were working in EMR 5.7. After ugprading, some of 
> our jobs which use a spark action are failing with the NoSuchMethod as shown 
> further in the description. It seems like conflicting classes.
> I noticed the spark share lib directory has two versions of the LZ4 jar.
> sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/lib_20181029182704/spark/*lz*
>  -rw-r--r--   3 oozie oozie  79845 2018-10-29 18:27 
> /user/oozie/share/lib/lib_20181029182704/spark/compress-lzf-1.0.3.jar
>  -rw-r--r--   3 hdfs  oozie 236880 2018-11-01 18:22 
> /user/oozie/share/lib/lib_20181029182704/spark/lz4-1.3.0.jar
>  -rw-r--r--   3 oozie oozie 370119 2018-10-29 18:27 
> /user/oozie/share/lib/lib_20181029182704/spark/lz4-java-1.4.0.jar
> But, both of these jars have the constructor 
> LZ4BlockInputStream(java/io/InputStream). The spark/jars directory has only 
> lz4-java-1.4.0.jar. share lib seems to be getting it from the 
> /usr/lib/oozie/oozie-sharelib.tar.gz.
> Unfortunately, my team member that knows most about Spark is on vacation. 
> Does anyone have any suggestions on how best to troubleshoot this problem?
> Here is the strack trace.
>   diagnostics: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, ip-172-27-113-49.ec2.internal, executor 2): 
> java.lang.NoSuchMethodError: 
> net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
>     at 
> org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
>     at scala.Option.map(Option.scala:146)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
>     at scala.Option.getOrElse(Option.scala:121)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>     at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1346)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>     at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at org.apache.spark.scheduler.Task.run(Task.scala:109)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Assigned] (SPARK-27106) merge CaseInsensitiveStringMap and DataSourceOptions

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27106:


Assignee: Wenchen Fan  (was: Apache Spark)

> merge CaseInsensitiveStringMap and DataSourceOptions
> 
>
> Key: SPARK-27106
> URL: https://issues.apache.org/jira/browse/SPARK-27106
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25863) java.lang.UnsupportedOperationException: empty.max at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala

2019-03-08 Thread Val Feldsher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787989#comment-16787989
 ] 

Val Feldsher commented on SPARK-25863:
--

[~maropu] yes it works now, I still see the warning but it doesn't crash. 
Thanks!

> java.lang.UnsupportedOperationException: empty.max at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> -
>
> Key: SPARK-25863
> URL: https://issues.apache.org/jira/browse/SPARK-25863
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: cache, catalyst, code-generation
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
>
> Failing task : 
> {noformat}
> An error occurred while calling o2875.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 
> in stage 21413.0 failed 4 times, most recent failure: Lost task 58.3 in stage 
> 21413.0 (TID 4057314, pc1udatahad117, executor 431): 
> java.lang.UnsupportedOperationException: empty.max
> at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
> at scala.collection.AbstractTraversable.max(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.updateAndGetCompilationStats(CodeGenerator.scala:1475)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1418)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
> at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:81)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:40)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
> at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:263)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$filteredCachedBatches$1.apply(InMemoryTableScanExec.scala:262)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:818)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 

[jira] [Assigned] (SPARK-27106) merge CaseInsensitiveStringMap and DataSourceOptions

2019-03-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27106:


Assignee: Apache Spark  (was: Wenchen Fan)

> merge CaseInsensitiveStringMap and DataSourceOptions
> 
>
> Key: SPARK-27106
> URL: https://issues.apache.org/jira/browse/SPARK-27106
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27106) merge CaseInsensitiveStringMap and DataSourceOptions

2019-03-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27106:
---

 Summary: merge CaseInsensitiveStringMap and DataSourceOptions
 Key: SPARK-27106
 URL: https://issues.apache.org/jira/browse/SPARK-27106
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

2019-03-08 Thread Rick Kramer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kramer resolved SPARK-23012.
-
Resolution: Fixed

> Support for predicate pushdown and partition pruning when left joining large 
> Hive tables
> 
>
> Key: SPARK-23012
> URL: https://issues.apache.org/jira/browse/SPARK-23012
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Rick Kramer
>Priority: Major
> Fix For: 2.4.0
>
>
> We have a hive view which left outer joins several large, partitioned orc 
> hive tables together on date. When the view is used in a hive query, hive 
> pushes date predicates down into the joins and prunes the partitions for all 
> tables. When I use this view from pyspark, the predicate is only used to 
> prune the left-most table and all partitions from the additional tables are 
> selected.
> For example, consider two partitioned hive tables a & b joined in a view:
> create table a (
>a_val string
> )
> partitioned by (ds string)
> stored as orc;
> create table b (
>b_val string
> )
> partitioned by (ds string)
> stored as orc;
> create view example_view as
> select
> a_val
> , b_val
> , ds
> from a 
> left outer join b on b.ds = a.ds
> Then in pyspark you might try to query from the view filtering on ds:
> spark.table('example_view').filter(F.col('ds') == '2018-01-01')
> If table a and b are large, this results in a plan that filters a on ds = 
> 2018-01-01, but selects scans all partitions of table b.
> If the join in the view is changed to an inner join, the predicate gets 
> pushed down to a & b and the partitions are pruned as you'd expect.
> In practice, the view is fairly complex and contains a lot of business logic 
> we'd prefer not to replicate in pyspark if we can avoid it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

2019-03-08 Thread Rick Kramer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kramer updated SPARK-23012:

Fix Version/s: 2.4.0

> Support for predicate pushdown and partition pruning when left joining large 
> Hive tables
> 
>
> Key: SPARK-23012
> URL: https://issues.apache.org/jira/browse/SPARK-23012
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Rick Kramer
>Priority: Major
> Fix For: 2.4.0
>
>
> We have a hive view which left outer joins several large, partitioned orc 
> hive tables together on date. When the view is used in a hive query, hive 
> pushes date predicates down into the joins and prunes the partitions for all 
> tables. When I use this view from pyspark, the predicate is only used to 
> prune the left-most table and all partitions from the additional tables are 
> selected.
> For example, consider two partitioned hive tables a & b joined in a view:
> create table a (
>a_val string
> )
> partitioned by (ds string)
> stored as orc;
> create table b (
>b_val string
> )
> partitioned by (ds string)
> stored as orc;
> create view example_view as
> select
> a_val
> , b_val
> , ds
> from a 
> left outer join b on b.ds = a.ds
> Then in pyspark you might try to query from the view filtering on ds:
> spark.table('example_view').filter(F.col('ds') == '2018-01-01')
> If table a and b are large, this results in a plan that filters a on ds = 
> 2018-01-01, but selects scans all partitions of table b.
> If the join in the view is changed to an inner join, the predicate gets 
> pushed down to a & b and the partitions are pruned as you'd expect.
> In practice, the view is fairly complex and contains a lot of business logic 
> we'd prefer not to replicate in pyspark if we can avoid it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

2019-03-08 Thread Rick Kramer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787968#comment-16787968
 ] 

Rick Kramer commented on SPARK-23012:
-

[~Saurabh Santhosh] We worked around it by materializing the hive view as a 
cached table and used the table from spark.

I did retest this when 2.3.0 was released and I'm pretty sure the issue was 
still there.

Good to know the 2.4 fixes it.

> Support for predicate pushdown and partition pruning when left joining large 
> Hive tables
> 
>
> Key: SPARK-23012
> URL: https://issues.apache.org/jira/browse/SPARK-23012
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Rick Kramer
>Priority: Major
>
> We have a hive view which left outer joins several large, partitioned orc 
> hive tables together on date. When the view is used in a hive query, hive 
> pushes date predicates down into the joins and prunes the partitions for all 
> tables. When I use this view from pyspark, the predicate is only used to 
> prune the left-most table and all partitions from the additional tables are 
> selected.
> For example, consider two partitioned hive tables a & b joined in a view:
> create table a (
>a_val string
> )
> partitioned by (ds string)
> stored as orc;
> create table b (
>b_val string
> )
> partitioned by (ds string)
> stored as orc;
> create view example_view as
> select
> a_val
> , b_val
> , ds
> from a 
> left outer join b on b.ds = a.ds
> Then in pyspark you might try to query from the view filtering on ds:
> spark.table('example_view').filter(F.col('ds') == '2018-01-01')
> If table a and b are large, this results in a plan that filters a on ds = 
> 2018-01-01, but selects scans all partitions of table b.
> If the join in the view is changed to an inner join, the predicate gets 
> pushed down to a & b and the partitions are pruned as you'd expect.
> In practice, the view is fairly complex and contains a lot of business logic 
> we'd prefer not to replicate in pyspark if we can avoid it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787834#comment-16787834
 ] 

Gabor Somogyi commented on SPARK-27044:
---

Thanks [~herberts] rebuilt shaded jar was my main use-case as well. 
Proceeding...

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-08 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787822#comment-16787822
 ] 

Attila Zsolt Piros commented on SPARK-27090:


I would have waited a little bit to find out what others opinion about this. 
But fine for me.

> Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-08 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787815#comment-16787815
 ] 

Ajith S commented on SPARK-27060:
-

Just checked on master branch, seems the property doesn't work anymore

{{Spark master: local[*], Application Id: local-1552045445603}}
{{spark-sql> set spark.sql.parser.ansi.enabled;}}
{{spark.sql.parser.ansi.enabled true}}
{{Time taken: 2.178 seconds, Fetched 1 row(s)}}
{{spark-sql> CREATE TABLE CREATE(a int);}}
{{Time taken: 0.502 seconds}}
{{spark-sql> SHOW TABLES;}}
{{default create false}}
{{Time taken: 0.121 seconds, Fetched 1 row(s)}}
{{spark-sql> }}

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> {code}
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> {code}
> Hive-Behaviour :
> {code}
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24252) DataSourceV2: Add catalog support

2019-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24252:
---

Assignee: Ryan Blue

> DataSourceV2: Add catalog support
> -
>
> Key: SPARK-24252
> URL: https://issues.apache.org/jira/browse/SPARK-24252
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> DataSourceV2 needs to support create and drop catalog operations in order to 
> support logical plans like CTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-08 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787797#comment-16787797
 ] 

Ajith S edited comment on SPARK-27060 at 3/8/19 11:20 AM:
--

[~sachin1729] can setting *spark.sql.parser.ansi.enabled=true* fix this.? this 
configuration is to make spark follow ANSI SQL standards strictly

if it doesn't, it may be issue


was (Author: ajithshetty):
[~sachin1729] can setting *spark.sql.parser.ansi.enabled=true* fix this.? this 
configuration is to make spark follow ANSI SQL standards strictly

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> {code}
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> {code}
> Hive-Behaviour :
> {code}
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24252) DataSourceV2: Add catalog support

2019-03-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24252.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23915
[https://github.com/apache/spark/pull/23915]

> DataSourceV2: Add catalog support
> -
>
> Key: SPARK-24252
> URL: https://issues.apache.org/jira/browse/SPARK-24252
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> DataSourceV2 needs to support create and drop catalog operations in order to 
> support logical plans like CTAS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-08 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787803#comment-16787803
 ] 

Ajith S commented on SPARK-27060:
-

Please refer SPARK-26215

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> {code}
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> {code}
> Hive-Behaviour :
> {code}
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27060) DDL Commands are accepting Keywords like create, drop as tableName

2019-03-08 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787797#comment-16787797
 ] 

Ajith S commented on SPARK-27060:
-

[~sachin1729] can setting *spark.sql.parser.ansi.enabled=true* fix this.? this 
configuration is to make spark follow ANSI SQL standards strictly

> DDL Commands are accepting Keywords like create, drop as tableName
> --
>
> Key: SPARK-27060
> URL: https://issues.apache.org/jira/browse/SPARK-27060
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sachin Ramachandra Setty
>Priority: Minor
>
> Seems to be a compatibility issue compared to other components such as hive 
> and mySql. 
> DDL commands are successful even though the tableName is same as keyword. 
> Tested with columnNames as well and issue exists. 
> Whereas, Hive-Beeline is throwing ParseException and not accepting keywords 
> as tableName or columnName and mySql is accepting keywords only as columnName.
> Spark-Behaviour :
> {code}
> Connected to: Spark SQL (version 2.3.2.0101)
> CLI_DBMS_APPID
> Beeline version 1.2.1.spark_2.3.2.0101 by Apache Hive
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table create(id int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.255 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table drop(int int);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.257 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table drop;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.236 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> drop table create;
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.168 seconds)
> 0: jdbc:hive2://10.18.3.XXX:23040/default> create table tab1(float float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.111 seconds)
> 0: jdbc:hive2://10.18.XXX:23040/default> create table double(double float);
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.093 seconds)
> {code}
> Hive-Behaviour :
> {code}
> Connected to: Apache Hive (version 3.1.0)
> Driver: Hive JDBC (version 3.1.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0 by Apache Hive
> 0: jdbc:hive2://10.18.XXX:21066/> create table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> create table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:13 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> create table tab1(float float);
> Error: Error while compiling statement: FAILED: ParseException line 1:18 
> cannot recognize input near 'float' 'float' ')' in column name or constraint 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18XXX:21066/> drop table create(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'create' '(' 'id' in table name 
> (state=42000,code=4)
> 0: jdbc:hive2://10.18.XXX:21066/> drop table drop(id int);
> Error: Error while compiling statement: FAILED: ParseException line 1:11 
> cannot recognize input near 'drop' '(' 'id' in table name 
> (state=42000,code=4)
> mySql :
> CREATE TABLE CREATE(ID integer);
> Error: near "CREATE": syntax error
> CREATE TABLE DROP(ID integer);
> Error: near "DROP": syntax error
> CREATE TABLE TAB1(FLOAT FLOAT);
> Success
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-03-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787792#comment-16787792
 ] 

Steve Loughran commented on SPARK-27098:


It was my suggestion to file this. 

* If this was AWS S3 this would "just" be AWS S3's eventul consistency 
surfacing on renames: the directory listing needed to mimic the rename missing 
the newly committed files -most likely when the write is immediately before the 
rename (v2 task commit, v1 job commit + the straggler tasks), with the 
solutions being the standard ones: use a table in dynamo for list consistency, 
or a zero-rename-committer which doesn't need consistent listings. (Or: Iceberg)

* But this is Ceph, which is, AFAIK, consistent.

# Who has played with Ceph as the destination store for queries? Through the 
S3A libraries?
# What do people think can be enabled/added to the spark-level committers to 
detect this problem. The tasks know the files they've actually created and can 
report to the job committer -it could do a post-job-commit audit of the output 
and  fail if something is missing.

[~mwlon] is going to be the one trying to debug this. 

Martin:
#  you can get some more logging of what's up in the S3A code by setting the 
log for {{org.apache.hadoop.fs.s3a.S3AFileSystem}} to debug and looking for log 
entries beginning "Rename path". At least I think so, that 2.7.x codebase is 3+ 
years old, frozen for all but security fixes for 12 months, and never going to 
get another release (related to the AWS SDK, ironically). 
# the Hadoop 2.9.x releases do have S3Guard in, and while using a remote DDB 
table to add consistency to a local Ceph store is pretty inefficient, it'd be 
interesting to see whether enabling it would make this problem go away. In 
which case, you've just found a bug in Ceph
# [Ryan's S3 committers|https://github.com/rdblue/s3committer] do work with 
hadoop 2.7.x. Try them


> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79567600 
> part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79388012 
> part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:14   79308387 
> part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:15   79455483 
> part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:17   79512342 
> part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79403307 
> part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79617769 
> part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:19   79333534 
> part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:20   79543324 
> part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> ```
> However, the write succeeds and leaves a _SUCCESS file.
> This can be caught by additionally checking afterward whether the number of 
> written file parts agrees with the number of partitions, but Spark should at 
> least fail on its 

[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Mathias Herberts (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787788#comment-16787788
 ] 

Mathias Herberts commented on SPARK-27044:
--

This is exactly what Jungtaek said, the ability to specify packages which have 
the non default classifier to fetch specific jars such as shaded/uber jars 
which is the most common use case.

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24291) Data source table is not displaying records when files are uploaded to table location

2019-03-08 Thread Sushanta Sen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787786#comment-16787786
 ] 

Sushanta Sen commented on SPARK-24291:
--

Yes,after refresh it fetches the data.But why this is happening when tables 
created with 'USING'.

> Data source table is not displaying records when files are uploaded to table 
> location
> -
>
> Key: SPARK-24291
> URL: https://issues.apache.org/jira/browse/SPARK-24291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version: 2.3
>Reporter: Sushanta Sen
>Priority: Major
>
> Precondition:
> 1.Already one .orc file exists in the /tmp/orcdata/ location
>  # Launch Spark-sql
>  # spark-sql> CREATE TABLE os_orc (name string, version string, other string) 
> USING ORC OPTIONS (path '/tmp/orcdata/');
>  # spark-sql> select * from os_orc;
> Spark 2.3.0 Apache
> Time taken: 2.538 seconds, Fetched 1 row(s)
>  # pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata*
> Found 1 items
> -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 
> /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc
> pc1:/opt/# *./hadoop fs -copyFromLocal 
> /opt/OS/loaddata/orcdata/part-1-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc
>  /tmp/orcdata/data2.orc*
> pc1:/opt/# *./hadoop dfs -ls /tmp/orcdata*
> Found *2* items
> -rw-r--r-- 3 spark hadoop 475 2018-05-15 14:59 /tmp/orcdata/data2.orc
> -rw-r--r-- 3 spark hadoop 475 2018-05-09 18:21 
> /tmp/orcdata/part-0-d488121b-e9fd-4269-a6ea-842c631722ee-c000.snappy.orc
> pc1:/opt/# ** 
>  5. Again execute the select command on the table os_orc
> spark-sql> select * from os_orc;
> Spark 2.3.0 Apache
> Time taken: 1.528 seconds, Fetched {color:#FF}1 row(s){color}
> Actual Result: On executing select command it does not display the all the 
> records exist in the data source table location
> Expected Result: All the records should be fetched and displayed for the data 
> source table from the location
> NB:
> 1.On exiting and relaunching the spark-sql session, select command fetches 
> the correct # of records.
>  2.This issue is valid for all the data source tables created with 'Using' .
> I came across this use case in Spark 2.2.1 when tried to reproduce a customer 
> site observation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787784#comment-16787784
 ] 

Gabor Somogyi commented on SPARK-27044:
---

I've already started but wanted to double check things from usage perspective.

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-03-08 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-27098:
---
Description: 
https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233

Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
occasionally a file part will be missing; i.e. part 3 here:

```
> aws s3 ls my-bucket/folder/
2019-02-28 13:07:21  0 _SUCCESS
2019-02-28 13:06:58   79428651 
part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:06:59   79586172 
part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:00   79561910 
part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:01   79192617 
part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:07   79364413 
part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:08   79623254 
part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:10   79445030 
part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:10   79474923 
part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:11   79477310 
part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:12   79331453 
part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:13   79567600 
part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:13   79388012 
part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:14   79308387 
part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:15   79455483 
part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:17   79512342 
part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:18   79403307 
part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:18   79617769 
part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:19   79333534 
part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:20   79543324 
part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
```

However, the write succeeds and leaves a _SUCCESS file.

This can be caught by additionally checking afterward whether the number of 
written file parts agrees with the number of partitions, but Spark should at 
least fail on its own and leave a meaningful stack trace in this case.

  was:
https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233

Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, occasionally a file part will be 
missing; i.e. part 3 here:

```
> aws s3 ls my-bucket/folder/
2019-02-28 13:07:21  0 _SUCCESS
2019-02-28 13:06:58   79428651 
part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:06:59   79586172 
part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:00   79561910 
part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:01   79192617 
part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:07   79364413 
part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:08   79623254 
part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:10   79445030 
part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:10   79474923 
part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:11   79477310 
part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:12   79331453 
part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:13   79567600 
part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:13   79388012 
part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:14   79308387 
part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:15   79455483 
part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:17   79512342 
part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:18   79403307 
part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:18   79617769 
part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:19   79333534 
part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
2019-02-28 13:07:20   79543324 

[jira] [Comment Edited] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787762#comment-16787762
 ] 

Jungtaek Lim edited comment on SPARK-27044 at 3/8/19 10:39 AM:
---

It's like "hive-exec" with default classifier vs "core" classifier. Former 
includes all transitive dependencies (without relocating) whereas latter just 
have itself and let Maven deal with transitive dependencies.

I've some experience regarding this: if no one minds I would like to work on 
this. [~gsomogyi] Are you OK with this? It's perfectly OK if you already took a 
step on this.


was (Author: kabhwan):
It's like "hive-exec" with default classifier vs "core" classifier. Former 
includes all transitive dependencies (without relocating) whereas latter just 
have itself and let Maven deal with transitive dependencies.

I've some experience regarding this: if no one minds I would like to work on 
this. [~gsomogyi] Are you OK with this? Or did you already take a step on this?

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`

2019-03-08 Thread Ivan Vergiliev (JIRA)
Ivan Vergiliev created SPARK-27105:
--

 Summary: Prevent exponential complexity in ORC `createFilter`  
 Key: SPARK-27105
 URL: https://issues.apache.org/jira/browse/SPARK-27105
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ivan Vergiliev


`OrcFilters.createFilters` currently has complexity that's exponential in the 
height of the filter tree. There are multiple places in Spark that try to 
prevent the generation of skewed trees so as to not trigger this behaviour, for 
example:
- `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` combines 
a number of binary logical expressions into a balanced tree.
- https://github.com/apache/spark/pull/22313 introduced a change to 
`OrcFilters` to create a balanced tree instead of a skewed tree.

However, the underlying exponential behaviour can still be triggered by code 
paths that don't go through any of the tree balancing methods. For example, if 
one generates a tree of `Column`s directly in user code, there's nothing in 
Spark that automatically balances that tree and, hence, skewed trees hit the 
exponential behaviour. We have hit this in production with jobs mysteriously 
taking hours on the Spark driver with no worker activity, with as few as ~30 OR 
filters.

I have a fix locally that makes the underlying logic have linear complexity 
instead of exponential complexity. With this fix, the code can handle thousands 
of filters in milliseconds. I'll send a PR with the fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787762#comment-16787762
 ] 

Jungtaek Lim commented on SPARK-27044:
--

It's like "hive-exec" with default classifier vs "core" classifier. Former 
includes all transitive dependencies (without relocating) whereas latter just 
have itself and let Maven deal with transitive dependencies.

I've some experience regarding this: if no one minds I would like to work on 
this. [~gsomogyi] Are you OK with this? Or did you already take a step on this?

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27044) Maven dependency resolution does not support classifiers

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787754#comment-16787754
 ] 

Gabor Somogyi commented on SPARK-27044:
---

[~herberts] Could you give an example of {quote}fetch 'uber' versions of 
packages{quote}?

> Maven dependency resolution does not support classifiers
> 
>
> Key: SPARK-27044
> URL: https://issues.apache.org/jira/browse/SPARK-27044
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Mathias Herberts
>Priority: Major
>
> The spark-submit --packages option allows dependencies to be specified using 
> the maven syntax group:artifact:version, but it does not support specifying a 
> classifier as group:artifact:version:classifier
> This makes it impossible to fetch 'uber' versions of packages for example as 
> they are usually specified using a classifier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-03-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787749#comment-16787749
 ] 

Hyukjin Kwon commented on SPARK-27100:
--

Hm, can you show reproducible steps including codes? I think it should be 
checked if there's the same issue in the current master branch or not.

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27087) Inability to access to column alias in pyspark

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27087.
--
Resolution: Invalid

> Inability to access to column alias in pyspark
> --
>
> Key: SPARK-27087
> URL: https://issues.apache.org/jira/browse/SPARK-27087
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> In pyspark I have the following:
> {code:java}
> import pyspark.sql.functions as F
> cc = F.lit(1).alias("A")
> print(cc)
> print(cc._jc.toString())
> {code}
> I get :
> {noformat}
> Column
> 1 AS `A`
> {noformat}
> Is there any way for me to just print "A" from cc ? it seems I'm unable to 
> extract the alias programatically from the column object.
> Also I think that in spark-sql in scala, if I print "cc" it would just print 
> "A" instead, so this seem like a bug or a missing feature to me



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27087) Inability to access to column alias in pyspark

2019-03-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787732#comment-16787732
 ] 

Hyukjin Kwon commented on SPARK-27087:
--

Scala side shows the same output as well:

{code}
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> lit(1).alias("A").toString
res2: String = 1 AS `A`
{code}

It shows the expression as is. So it's correct to show the expression in string 
as is. not a bug or missing feature.

> Inability to access to column alias in pyspark
> --
>
> Key: SPARK-27087
> URL: https://issues.apache.org/jira/browse/SPARK-27087
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> In pyspark I have the following:
> {code:java}
> import pyspark.sql.functions as F
> cc = F.lit(1).alias("A")
> print(cc)
> print(cc._jc.toString())
> {code}
> I get :
> {noformat}
> Column
> 1 AS `A`
> {noformat}
> Is there any way for me to just print "A" from cc ? it seems I'm unable to 
> extract the alias programatically from the column object.
> Also I think that in spark-sql in scala, if I print "cc" it would just print 
> "A" instead, so this seem like a bug or a missing feature to me



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27027:
-
Component/s: Spark Shell

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787722#comment-16787722
 ] 

Hyukjin Kwon commented on SPARK-27027:
--

[~hluu], just to confirm, did you face this issue in Spark shell?

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27027:
-
Affects Version/s: 3.0.0

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-27027:
--

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787720#comment-16787720
 ] 

Gabor Somogyi commented on SPARK-27027:
---

After compile
{quote}./bin/spark-shell --packages 
org.apache.spark:spark-avro_2.12:3.0.0-SNAPSHOT{quote}
but as far as I remember some changes were needed in the commands.


> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

2019-03-08 Thread Saurabh Santhosh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787716#comment-16787716
 ] 

Saurabh Santhosh commented on SPARK-23012:
--

[~yumwang] [~reks95]

Hi, tested this in Spark 2.4.0 and its woking fine :)

> Support for predicate pushdown and partition pruning when left joining large 
> Hive tables
> 
>
> Key: SPARK-23012
> URL: https://issues.apache.org/jira/browse/SPARK-23012
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Rick Kramer
>Priority: Major
>
> We have a hive view which left outer joins several large, partitioned orc 
> hive tables together on date. When the view is used in a hive query, hive 
> pushes date predicates down into the joins and prunes the partitions for all 
> tables. When I use this view from pyspark, the predicate is only used to 
> prune the left-most table and all partitions from the additional tables are 
> selected.
> For example, consider two partitioned hive tables a & b joined in a view:
> create table a (
>a_val string
> )
> partitioned by (ds string)
> stored as orc;
> create table b (
>b_val string
> )
> partitioned by (ds string)
> stored as orc;
> create view example_view as
> select
> a_val
> , b_val
> , ds
> from a 
> left outer join b on b.ds = a.ds
> Then in pyspark you might try to query from the view filtering on ds:
> spark.table('example_view').filter(F.col('ds') == '2018-01-01')
> If table a and b are large, this results in a plan that filters a on ds = 
> 2018-01-01, but selects scans all partitions of table b.
> If the join in the view is changed to an inner join, the predicate gets 
> pushed down to a & b and the partitions are pruned as you'd expect.
> In practice, the view is fairly complex and contains a lot of business logic 
> we'd prefer not to replicate in pyspark if we can avoid it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787708#comment-16787708
 ] 

Hyukjin Kwon commented on SPARK-27027:
--

How did you run spark-shell?

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787701#comment-16787701
 ] 

Gabor Somogyi commented on SPARK-27027:
---

With spark-shell, in unit test the functionality works.

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787695#comment-16787695
 ] 

Hyukjin Kwon commented on SPARK-27027:
--

So, how did we reproduce in the current master? Please set the affected version 
to 3.0.0 too and fix the JIRA description.

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27104) Spark Fair scheduler across applications in standalone mode

2019-03-08 Thread Hua Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Zhang updated SPARK-27104:
--
Description: 
Spark in standalone mode currently only supports FIFO (first-in-first-out) 
scheduler across applications.

It will be great that a fair scheduler is supported. +A fair scheduler across 
applications, not in a application.+

 

Use case (for example with the integration of zeppelin)

At certain moment, user A submits a heavy application and consumes all the 
resources of the spark cluster.

At a later moment, user B submits a second application.

No matter how many work nodes you added now, all the resources go to user A due 
to the FIFO. User B will never get any resource until user A release its 
allocated resources.

 

A fair scheduler should distribute extra resources in a fair way on all running 
applications, which demands resources.

 

  was:
Spark in standalone mode currently only supports FIFO (first-in-first-out) 
scheduler across applications.

It will be great that a fair scheduler is supported. +A fair scheduler across 
applications, not in a application.+

 

Use case (for example with the integration of zeppelin)

At certain moment, user A submits an heavy application and consumes all the 
resources of the spark cluster.

At a later moment, user B submits a second application.

No matter how many work nodes you added now, all the resources go to user A due 
to the FIFO. User B will never get any resource until user A release its 
allocated resources.

 

A fair scheduler should distribute extra resources in a fair way on all running 
applications, which demands resources.

 


> Spark Fair scheduler across applications in standalone mode
> ---
>
> Key: SPARK-27104
> URL: https://issues.apache.org/jira/browse/SPARK-27104
> Project: Spark
>  Issue Type: Wish
>  Components: Scheduler
>Affects Versions: 2.2.3, 2.3.3, 2.4.0
>Reporter: Hua Zhang
>Priority: Minor
>
> Spark in standalone mode currently only supports FIFO (first-in-first-out) 
> scheduler across applications.
> It will be great that a fair scheduler is supported. +A fair scheduler across 
> applications, not in a application.+
>  
> Use case (for example with the integration of zeppelin)
> At certain moment, user A submits a heavy application and consumes all the 
> resources of the spark cluster.
> At a later moment, user B submits a second application.
> No matter how many work nodes you added now, all the resources go to user A 
> due to the FIFO. User B will never get any resource until user A release its 
> allocated resources.
>  
> A fair scheduler should distribute extra resources in a fair way on all 
> running applications, which demands resources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27104) Spark Fair scheduler across applications in standalone mode

2019-03-08 Thread Hua Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Zhang updated SPARK-27104:
--
Description: 
Spark in standalone mode currently only supports FIFO (first-in-first-out) 
scheduler across applications.

It will be great that a fair scheduler is supported. +A fair scheduler across 
applications, not in a application.+

 

Use case (for example with the integration of zeppelin)

At certain moment, user A submits an heavy application and consumes all the 
resources of the spark cluster.

At a later moment, user B submits a second application.

No matter how many work nodes you added now, all the resources go to user A due 
to the FIFO. User B will never get any resource until user A release its 
allocated resources.

 

A fair scheduler should distribute extra resources in a fair way on all running 
applications, which demands resources.

 

  was:
Spark in standalone mode currently only supports FIFO (first-in-first-out) 
scheduler across applications.

It will be great that a fair scheduler is supported. +A fair scheduler across 
applications, not in a application.+

 

Use case (for example with the integration of zeppelin)

At certain moment, user A submits an heavy application en consumes all the 
resources of the spark cluster.

At a later moment, user B submits a second application.

No matter how many work nodes you added now, all the resources go to user A due 
to the FIFO. User B will never get any resource until user A release its 
allocated resources.

 

A fair scheduler should distribute extra resources in a fair way on all running 
applications, which demands resources.

 


> Spark Fair scheduler across applications in standalone mode
> ---
>
> Key: SPARK-27104
> URL: https://issues.apache.org/jira/browse/SPARK-27104
> Project: Spark
>  Issue Type: Wish
>  Components: Scheduler
>Affects Versions: 2.2.3, 2.3.3, 2.4.0
>Reporter: Hua Zhang
>Priority: Minor
>
> Spark in standalone mode currently only supports FIFO (first-in-first-out) 
> scheduler across applications.
> It will be great that a fair scheduler is supported. +A fair scheduler across 
> applications, not in a application.+
>  
> Use case (for example with the integration of zeppelin)
> At certain moment, user A submits an heavy application and consumes all the 
> resources of the spark cluster.
> At a later moment, user B submits a second application.
> No matter how many work nodes you added now, all the resources go to user A 
> due to the FIFO. User B will never get any resource until user A release its 
> allocated resources.
>  
> A fair scheduler should distribute extra resources in a fair way on all 
> running applications, which demands resources.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27104) Spark Fair scheduler across applications in standalone mode

2019-03-08 Thread Hua Zhang (JIRA)
Hua Zhang created SPARK-27104:
-

 Summary: Spark Fair scheduler across applications in standalone 
mode
 Key: SPARK-27104
 URL: https://issues.apache.org/jira/browse/SPARK-27104
 Project: Spark
  Issue Type: Wish
  Components: Scheduler
Affects Versions: 2.4.0, 2.3.3, 2.2.3
Reporter: Hua Zhang


Spark in standalone mode currently only supports FIFO (first-in-first-out) 
scheduler across applications.

It will be great that a fair scheduler is supported. +A fair scheduler across 
applications, not in a application.+

 

Use case (for example with the integration of zeppelin)

At certain moment, user A submits an heavy application en consumes all the 
resources of the spark cluster.

At a later moment, user B submits a second application.

No matter how many work nodes you added now, all the resources go to user A due 
to the FIFO. User B will never get any resource until user A release its 
allocated resources.

 

A fair scheduler should distribute extra resources in a fair way on all running 
applications, which demands resources.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787662#comment-16787662
 ] 

Gabor Somogyi commented on SPARK-27027:
---

Please see the comments. The issue fully reproducible.

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-03-08 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787662#comment-16787662
 ] 

Gabor Somogyi edited comment on SPARK-27027 at 3/8/19 8:34 AM:
---

Please see the comments. The issue fully reproducible even on master.


was (Author: gsomogyi):
Please see the comments. The issue fully reproducible.

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org