date:20170515

[jira] [Created] (SPARK-20761) Union uses column order rather than schema

2017-05-15 Thread Nakul Jeirath (JIRA)

Nakul Jeirath created SPARK-20761:
-

 Summary: Union uses column order rather than schema
 Key: SPARK-20761
 URL: https://issues.apache.org/jira/browse/SPARK-20761
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Nakul Jeirath
Priority: Minor


I believe there is an issue when using union to combine two dataframes when the 
order of columns differ between the left and right side of the union:

{code}
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
StructType}

val schema = StructType(Seq(
  StructField("id", StringType, false),
  StructField("flag_one", BooleanType, false),
  StructField("flag_two", BooleanType, false),
  StructField("flag_three", BooleanType, false)
))

val rowRdd = spark.sparkContext.parallelize(Seq(
  Row("1", true, false, false),
  Row("2", false, true, false),
  Row("3", false, false, true)
))

spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")

val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)

//Select columns out of order with respect to the emptyData schema
val data = emptyData.union(spark.sql("select id, flag_two, flag_three, flag_one 
from temp_flags"))
{code}

Selecting the data from the "temp_flags" table results in:

{noformat}
spark.sql("select * from temp_flags").show
+---+++--+
| id|flag_one|flag_two|flag_three|
+---+++--+
|  1|true|   false| false|
|  2|   false|true| false|
|  3|   false|   false|  true|
+---+++--+
{noformat}

Which is the data we'd expect but when inspecting "data" we get:

{noformat}
data.show()
+---+++--+
| id|flag_one|flag_two|flag_three|
+---+++--+
|  1|   false|   false|  true|
|  2|true|   false| false|
|  3|   false|true| false|
+---+++--+
{noformat}

Having a non-empty dataframe on the left side of the union doesn't seem to make 
a difference either:

{noformat}
spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
flag_three, flag_one from temp_flags")).show
+---+++--+
| id|flag_one|flag_two|flag_three|
+---+++--+
|  1|true|   false| false|
|  2|   false|true| false|
|  3|   false|   false|  true|
|  1|   false|   false|  true|
|  2|true|   false| false|
|  3|   false|true| false|
+---+++--+
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20502) ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit

2017-05-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20502.
---
   Resolution: Done
Fix Version/s: 2.2.0

> ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-20502
> URL: https://issues.apache.org/jira/browse/SPARK-20502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20502) ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit

2017-05-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011768#comment-16011768
 ] 

Joseph K. Bradley commented on SPARK-20502:
---

Thanks for doing this audit!  Note that the list includes some items which are 
actually package private.  I largely agree with you about not making changes 
right now.  We could arguably make some more things non-Experimental, but 
really, I'd prefer to leave them as-is for this release.  Some of the main 
items:
* Summaries: I wonder if we should keep these Experimental b/c of future 
extensions, for which we'd want to make these final.
* Keep Evaluators Experimental b/c of ongoing discussions about supporting 
multiple metrics, etc.
* Keep RFormula Experimental b/c of existing differences from R which may 
require behavior changes to fix

I'll mark this as done, but others can comment if they disagree.  Thanks 
[~yuhaoyan]!

> ML, Graph 2.2 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-20502
> URL: https://issues.apache.org/jira/browse/SPARK-20502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20505) ML, Graph 2.2 QA: Update user guide for new features & APIs

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011751#comment-16011751
 ] 

Apache Spark commented on SPARK-20505:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17994

> ML, Graph 2.2 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-20505
> URL: https://issues.apache.org/jira/browse/SPARK-20505
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011749#comment-16011749
 ] 

Joseph K. Bradley commented on SPARK-20503:
---

[~mlnick] Did you do this audit?  If not, let me know, and I can.

> ML 2.2 QA: API: Python API coverage
> ---
>
> Key: SPARK-20503
> URL: https://issues.apache.org/jira/browse/SPARK-20503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20501) ML, Graph 2.2 QA: API: New Scala APIs, docs

2017-05-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011727#comment-16011727
 ] 

Joseph K. Bradley commented on SPARK-20501:
---

I checked those classes docs as well, so I went ahead and closed this.  Thanks!

> ML, Graph 2.2 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-20501
> URL: https://issues.apache.org/jira/browse/SPARK-20501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20501) ML, Graph 2.2 QA: API: New Scala APIs, docs

2017-05-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20501.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17934
[https://github.com/apache/spark/pull/17934]

> ML, Graph 2.2 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-20501
> URL: https://issues.apache.org/jira/browse/SPARK-20501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Description: 
Memory leak for RDD blocks for a long time running rdd process.

We  have a long term running application, which is doing computations of RDDs. 
and we found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage do not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
issue in Yarn Cluster mode both in kafka streaming and batch applications. The 
issue in streaming is similar, however, it seems the rdd blocks grows a bit 
slower than batch jobs. 

The below is the sample code and it is reproducible by justing running it in 
local mode. 
Scala file:
{code}
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

{code}
build sbt file:
{code}
name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}
{code}

To reproduce it: 

Just 
{code}

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
{code}


  was:
Memory leak for RDD blocks for a long time running rdd process.

I have a long term running application, which is doing computations of RDDs. 
and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage does not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
{code}
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

{code}
build sbt file:
{code}
name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}
{code}

To reproduce it: 

Just 
{code}

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar

[jira] [Issue Comment Deleted] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Comment: was deleted

(was: RDD blocks are growing crazily after running for a couple of hours)

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>Priority: Critical
> Attachments: RDD Blocks .png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> I have a long term running application, which is doing computations of RDDs. 
> and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
> blocks and memory usage does not mach the cached rdds and memory. It looks 
> like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. 
> The below is the minimized code and it is reproducible by justing running it 
> in local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Attachment: RDD Blocks .png

RDD blocks are increasing crazily after running the app for a couple of hours, 
see the attached screen shot of the spark ui page

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>Priority: Critical
> Attachments: RDD Blocks .png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> I have a long term running application, which is doing computations of RDDs. 
> and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
> blocks and memory usage does not mach the cached rdds and memory. It looks 
> like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. 
> The below is the minimized code and it is reproducible by justing running it 
> in local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Attachment: (was: Screen Shot 2017-05-16 at 1.47.06 pm.png)

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>Priority: Critical
>
> Memory leak for RDD blocks for a long time running rdd process.
> I have a long term running application, which is doing computations of RDDs. 
> and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
> blocks and memory usage does not mach the cached rdds and memory. It looks 
> like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. 
> The below is the minimized code and it is reproducible by justing running it 
> in local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Attachment: Screen Shot 2017-05-16 at 1.47.06 pm.png

RDD blocks are growing crazily after running for a couple of hours

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>Priority: Critical
> Attachments: Screen Shot 2017-05-16 at 1.47.06 pm.png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> I have a long term running application, which is doing computations of RDDs. 
> and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
> blocks and memory usage does not mach the cached rdds and memory. It looks 
> like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. 
> The below is the minimized code and it is reproducible by justing running it 
> in local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Description: 
Memory leak for RDD blocks for a long time running rdd process.

I have a long term running application, which is doing computations of RDDs. 
and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage does not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
{code}
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

{code}
build sbt file:
{code}
name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}
{code}

To reproduce it: 

Just 
{code}

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
{code}


  was:
Memory leak for RDD blocks for a long time running rdd process.

I have a long term running application, which is doing computations of RDDs. 
and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage does not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
{code}
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

{code}
build sbt file:

name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}

To reproduce it: 

Just 

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar



> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Description: 
Memory leak for RDD blocks for a long time running rdd process.

I have a long term running application, which is doing computations of RDDs. 
and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage does not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
{code}
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

{code}
build sbt file:

name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}

To reproduce it: 

Just 

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar


  was:
Memory leak for RDD blocks for a long time running rdd process.

I have a long term running application, which is doing computations of RDDs. 
and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage does not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
{{{
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

}}}
build sbt file:

name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}

To reproduce it: 

Just 

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar



> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block

[jira] [Updated] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Binzi Cao updated SPARK-20760:
--
Description: 
Memory leak for RDD blocks for a long time running rdd process.

I have a long term running application, which is doing computations of RDDs. 
and I found the RDD blocks are keep increasing in the spark ui page. The rdd 
blocks and memory usage does not mach the cached rdds and memory. It looks like 
spark keeps old rdd in memory and never released it or never got a chance to 
release it. The job will eventually die of out of memory. 

In addition, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
{{{
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}

}}}
build sbt file:

name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}

To reproduce it: 

Just 

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar


  was:
Memory lead for RDD blocks for a long time running rdd process. I have a long 
term running application, which is doing caculations of RDDs. and I found the 
RDD blocks are keep increasing. The rdd blocks and memory usage does not mach 
the cached rdds and memory. It looks like spark keeps old rdds in memory and 
never released it. In addtion, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}
build sbt file:

name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}

To reproduce it: 

Just 

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar



> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>

[jira] [Created] (SPARK-20760) Memory Leak of RDD blocks

2017-05-15 Thread Binzi Cao (JIRA)

Binzi Cao created SPARK-20760:
-

 Summary: Memory Leak of RDD blocks 
 Key: SPARK-20760
 URL: https://issues.apache.org/jira/browse/SPARK-20760
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 2.1.0
 Environment: Spark 2.1.0
Reporter: Binzi Cao
Priority: Critical


Memory lead for RDD blocks for a long time running rdd process. I have a long 
term running application, which is doing caculations of RDDs. and I found the 
RDD blocks are keep increasing. The rdd blocks and memory usage does not mach 
the cached rdds and memory. It looks like spark keeps old rdds in memory and 
never released it. In addtion, I'm not seeing this issue in spark 1.6. 

The below is the minimized code and it is reproducible by justing running it in 
local mode. 
Scala file:
import scala.concurrent.duration.Duration
import scala.util.{Try, Failure, Success}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import scala.concurrent._
import ExecutionContext.Implicits.global
case class Person(id: String, name: String)
object RDDApp {
  def run(sc: SparkContext) = {
while (true) {
  val r = scala.util.Random
  val data = (1 to r.nextInt(100)).toList.map { a =>
Person(a.toString, a.toString)
  }
  val rdd = sc.parallelize(data)
  rdd.cache
  println("running")
  val a = (1 to 100).toList.map { x =>
Future(rdd.filter(_.id == x.toString).collect)
  }
  a.foreach { f =>
println(Await.ready(f, Duration.Inf).value.get)
  }
  rdd.unpersist()
}

  }
  def main(args: Array[String]): Unit = {
   val conf = new SparkConf().setAppName("test")
val sc   = new SparkContext(conf)
run(sc)

  }
}
build sbt file:

name := "RDDTest"
version := "0.1.1"


scalaVersion := "2.11.5"

libraryDependencies ++= Seq (
"org.scalaz" %% "scalaz-core" % "7.2.0",
"org.scalaz" %% "scalaz-concurrent" % "7.2.0",
"org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
"org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
  )

addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
mainClass in assembly := Some("RDDApp")
test in assembly := {}

To reproduce it: 

Just 

spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
--executor-memory 4G \
--executor-cores 1 \
--num-executors 1 \
--class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20758) Add Constant propagation optimization

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20758:


Assignee: (was: Apache Spark)

> Add Constant propagation optimization
> -
>
> Key: SPARK-20758
> URL: https://issues.apache.org/jira/browse/SPARK-20758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Tejas Patil
>Priority: Minor
>
> Constant propagation involves substituting attributes which can be statically 
> evaluated in expressions. Its a pretty common optimization in compilers world.
> eg.
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = i + 3
> {noformat}
> can be re-written as:
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = 8
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20759) SCALA_VERSION in _config.yml should be consistent with pom.xml

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20759:


Assignee: Apache Spark

> SCALA_VERSION in _config.yml should be consistent with pom.xml
> --
>
> Key: SPARK-20759
> URL: https://issues.apache.org/jira/browse/SPARK-20759
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: Apache Spark
>
> SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think 
> SCALA_VERSION in _config.yml should be consistent with pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20759) SCALA_VERSION in _config.yml should be consistent with pom.xml

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20759:


Assignee: (was: Apache Spark)

> SCALA_VERSION in _config.yml should be consistent with pom.xml
> --
>
> Key: SPARK-20759
> URL: https://issues.apache.org/jira/browse/SPARK-20759
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think 
> SCALA_VERSION in _config.yml should be consistent with pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20758) Add Constant propagation optimization

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20758:


Assignee: Apache Spark

> Add Constant propagation optimization
> -
>
> Key: SPARK-20758
> URL: https://issues.apache.org/jira/browse/SPARK-20758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
>
> Constant propagation involves substituting attributes which can be statically 
> evaluated in expressions. Its a pretty common optimization in compilers world.
> eg.
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = i + 3
> {noformat}
> can be re-written as:
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = 8
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20759) SCALA_VERSION in _config.yml should be consistent with pom.xml

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011675#comment-16011675
 ] 

Apache Spark commented on SPARK-20759:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17992

> SCALA_VERSION in _config.yml should be consistent with pom.xml
> --
>
> Key: SPARK-20759
> URL: https://issues.apache.org/jira/browse/SPARK-20759
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think 
> SCALA_VERSION in _config.yml should be consistent with pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20758) Add Constant propagation optimization

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011674#comment-16011674
 ] 

Apache Spark commented on SPARK-20758:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17993

> Add Constant propagation optimization
> -
>
> Key: SPARK-20758
> URL: https://issues.apache.org/jira/browse/SPARK-20758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Tejas Patil
>Priority: Minor
>
> Constant propagation involves substituting attributes which can be statically 
> evaluated in expressions. Its a pretty common optimization in compilers world.
> eg.
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = i + 3
> {noformat}
> can be re-written as:
> {noformat}
> SELECT * FROM table WHERE i = 5 AND j = 8
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20759) SCALA_VERSION in _config.yml should be consistent with pom.xml

2017-05-15 Thread liuzhaokun (JIRA)

liuzhaokun created SPARK-20759:
--

 Summary: SCALA_VERSION in _config.yml should be consistent with 
pom.xml
 Key: SPARK-20759
 URL: https://issues.apache.org/jira/browse/SPARK-20759
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.1
Reporter: liuzhaokun


SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think 
SCALA_VERSION in _config.yml should be consistent with pom.xml.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20758) Add Constant propagation optimization

2017-05-15 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-20758:
---

 Summary: Add Constant propagation optimization
 Key: SPARK-20758
 URL: https://issues.apache.org/jira/browse/SPARK-20758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Tejas Patil
Priority: Minor


Constant propagation involves substituting attributes which can be statically 
evaluated in expressions. Its a pretty common optimization in compilers world.

eg.
{noformat}
SELECT * FROM table WHERE i = 5 AND j = i + 3
{noformat}

can be re-written as:
{noformat}
SELECT * FROM table WHERE i = 5 AND j = 8
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20757) Spark timeout several small optimization

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20757:


Assignee: (was: Apache Spark)

> Spark timeout several small optimization
> 
>
> Key: SPARK-20757
> URL: https://issues.apache.org/jira/browse/SPARK-20757
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1, 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20757) Spark timeout several small optimization

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20757:


Assignee: Apache Spark

> Spark timeout several small optimization
> 
>
> Key: SPARK-20757
> URL: https://issues.apache.org/jira/browse/SPARK-20757
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1, 2.3.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20757) Spark timeout several small optimization

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011670#comment-16011670
 ] 

Apache Spark commented on SPARK-20757:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17991

> Spark timeout several small optimization
> 
>
> Key: SPARK-20757
> URL: https://issues.apache.org/jira/browse/SPARK-20757
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1, 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20757) Spark timeout several small optimization

2017-05-15 Thread liuzhaokun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuzhaokun updated SPARK-20757:
---
Comment: was deleted

(was: I think it's very meaningful.Look forward to community merge it.)

> Spark timeout several small optimization
> 
>
> Key: SPARK-20757
> URL: https://issues.apache.org/jira/browse/SPARK-20757
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1, 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20757) Spark timeout several small optimization

2017-05-15 Thread liuzhaokun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011665#comment-16011665
 ] 

liuzhaokun commented on SPARK-20757:


I think it's very meaningful.Look forward to community merge it.

> Spark timeout several small optimization
> 
>
> Key: SPARK-20757
> URL: https://issues.apache.org/jira/browse/SPARK-20757
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1, 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-15 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-20504:
---
Comment: was deleted

(was: You’re right this is really a headache. Java tools cannot extract several 
information `scalac` generated from the jars, such as the package private 
modifier, the private class modifier, and so on.
Through scala-doc to fetch and compare these change will be a terrible thing by 
programming processing I hope to avoid this way.
But these information was still reserved in the jar, I think we can get them 
through `scala reflection api`,
http://docs.scala-lang.org/overviews/reflection/overview.html
but I need to study them first.

Thanks!

Sent from Windows Mail

From: Joseph K. Bradley (JIRA)
Sent: ‎Tuesday‎, ‎May‎ ‎16‎, ‎2017 ‎7‎:‎24‎ ‎AM
To: weichenxu...@outlook.com


[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011293#comment-16011293
 ]

Joseph K. Bradley edited comment on SPARK-20504 at 5/15/17 11:23 PM:
-

I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (Update: This foreachActive API will be 
OK to leave as is.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.



was (Author: josephkb):
I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (I'll ping on that PR to see what people 
want to do.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
)

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.2.0
>
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To

[jira] [Created] (SPARK-20757) Spark timeout several small optimization

2017-05-15 Thread guoxiaolongzte (JIRA)

guoxiaolongzte created SPARK-20757:
--

 Summary: Spark timeout several small optimization
 Key: SPARK-20757
 URL: https://issues.apache.org/jira/browse/SPARK-20757
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 2.1.1, 2.3.0
Reporter: guoxiaolongzte
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20707) ML deprecated APIs should be removed in major release.

2017-05-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-20707.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> ML deprecated APIs should be removed in major release.
> --
>
> Key: SPARK-20707
> URL: https://issues.apache.org/jira/browse/SPARK-20707
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.2.0
>
>
> Before 2.2, MLlib always keep to remove APIs deprecated in last feature/minor 
> release. But from Spark 2.2, we decide to remove deprecated APIs in a major 
> release, so we need to change corresponding annotations to tell users those 
> will be removed in 3.0.
> See discussion at https://github.com/apache/spark/pull/17867



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20501) ML, Graph 2.2 QA: API: New Scala APIs, docs

2017-05-15 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011622#comment-16011622
 ] 

Yanbo Liang commented on SPARK-20501:
-

All of the classes you mentioned have been reviewed. [~mlnick] [~josephkb]

> ML, Graph 2.2 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-20501
> URL: https://issues.apache.org/jira/browse/SPARK-20501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20756) yarn-shuffle jar has references to unshaded guava and contains scala classes

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20756:


Assignee: Apache Spark

> yarn-shuffle jar has references to unshaded guava and contains scala classes
> 
>
> Key: SPARK-20756
> URL: https://issues.apache.org/jira/browse/SPARK-20756
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>
> There are 2 problems with yarn's shuffle jar currently:
> 1. It contains shaded guava but it contains references to unshaded classes.
> {code}
> # Guava is correctly relocated
> >jar -tf common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar | grep 
> >guava | head
> META-INF/maven/com.google.guava/
> META-INF/maven/com.google.guava/guava/
> META-INF/maven/com.google.guava/guava/pom.properties
> META-INF/maven/com.google.guava/guava/pom.xml
> org/spark_project/guava/
> org/spark_project/guava/annotations/
> org/spark_project/guava/annotations/Beta.class
> org/spark_project/guava/annotations/GwtCompatible.class
> org/spark_project/guava/annotations/GwtIncompatible.class
> org/spark_project/guava/annotations/VisibleForTesting.class
> # But, there are still references to unshaded guava
> >javap -cp common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar -c 
> >org/apache/spark/network/yarn/YarnShuffleService | grep google
>   57: invokestatic  #139// Method 
> com/google/common/collect/Lists.newArrayList:()Ljava/util/ArrayList;
> {code}
> 2. There are references to scala classes in the uber jar:
> {code}
> jar -tf 
> /opt/src/spark/common/network-yarn/target/scala-2.11/spark-*yarn-shuffle.jar 
> | grep "^scala"
> scala/AnyVal.class
> {code}
> We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20756) yarn-shuffle jar has references to unshaded guava and contains scala classes

2017-05-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20756:


Assignee: (was: Apache Spark)

> yarn-shuffle jar has references to unshaded guava and contains scala classes
> 
>
> Key: SPARK-20756
> URL: https://issues.apache.org/jira/browse/SPARK-20756
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> There are 2 problems with yarn's shuffle jar currently:
> 1. It contains shaded guava but it contains references to unshaded classes.
> {code}
> # Guava is correctly relocated
> >jar -tf common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar | grep 
> >guava | head
> META-INF/maven/com.google.guava/
> META-INF/maven/com.google.guava/guava/
> META-INF/maven/com.google.guava/guava/pom.properties
> META-INF/maven/com.google.guava/guava/pom.xml
> org/spark_project/guava/
> org/spark_project/guava/annotations/
> org/spark_project/guava/annotations/Beta.class
> org/spark_project/guava/annotations/GwtCompatible.class
> org/spark_project/guava/annotations/GwtIncompatible.class
> org/spark_project/guava/annotations/VisibleForTesting.class
> # But, there are still references to unshaded guava
> >javap -cp common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar -c 
> >org/apache/spark/network/yarn/YarnShuffleService | grep google
>   57: invokestatic  #139// Method 
> com/google/common/collect/Lists.newArrayList:()Ljava/util/ArrayList;
> {code}
> 2. There are references to scala classes in the uber jar:
> {code}
> jar -tf 
> /opt/src/spark/common/network-yarn/target/scala-2.11/spark-*yarn-shuffle.jar 
> | grep "^scala"
> scala/AnyVal.class
> {code}
> We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20756) yarn-shuffle jar has references to unshaded guava and contains scala classes

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011582#comment-16011582
 ] 

Apache Spark commented on SPARK-20756:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/17990

> yarn-shuffle jar has references to unshaded guava and contains scala classes
> 
>
> Key: SPARK-20756
> URL: https://issues.apache.org/jira/browse/SPARK-20756
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> There are 2 problems with yarn's shuffle jar currently:
> 1. It contains shaded guava but it contains references to unshaded classes.
> {code}
> # Guava is correctly relocated
> >jar -tf common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar | grep 
> >guava | head
> META-INF/maven/com.google.guava/
> META-INF/maven/com.google.guava/guava/
> META-INF/maven/com.google.guava/guava/pom.properties
> META-INF/maven/com.google.guava/guava/pom.xml
> org/spark_project/guava/
> org/spark_project/guava/annotations/
> org/spark_project/guava/annotations/Beta.class
> org/spark_project/guava/annotations/GwtCompatible.class
> org/spark_project/guava/annotations/GwtIncompatible.class
> org/spark_project/guava/annotations/VisibleForTesting.class
> # But, there are still references to unshaded guava
> >javap -cp common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar -c 
> >org/apache/spark/network/yarn/YarnShuffleService | grep google
>   57: invokestatic  #139// Method 
> com/google/common/collect/Lists.newArrayList:()Ljava/util/ArrayList;
> {code}
> 2. There are references to scala classes in the uber jar:
> {code}
> jar -tf 
> /opt/src/spark/common/network-yarn/target/scala-2.11/spark-*yarn-shuffle.jar 
> | grep "^scala"
> scala/AnyVal.class
> {code}
> We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20756) yarn-shuffle jar has references to unshaded guava and contains scala classes

2017-05-15 Thread Mark Grover (JIRA)

Mark Grover created SPARK-20756:
---

 Summary: yarn-shuffle jar has references to unshaded guava and 
contains scala classes
 Key: SPARK-20756
 URL: https://issues.apache.org/jira/browse/SPARK-20756
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Mark Grover


There are 2 problems with yarn's shuffle jar currently:
1. It contains shaded guava but it contains references to unshaded classes.
{code}
# Guava is correctly relocated
>jar -tf common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar | grep 
>guava | head
META-INF/maven/com.google.guava/
META-INF/maven/com.google.guava/guava/
META-INF/maven/com.google.guava/guava/pom.properties
META-INF/maven/com.google.guava/guava/pom.xml
org/spark_project/guava/
org/spark_project/guava/annotations/
org/spark_project/guava/annotations/Beta.class
org/spark_project/guava/annotations/GwtCompatible.class
org/spark_project/guava/annotations/GwtIncompatible.class
org/spark_project/guava/annotations/VisibleForTesting.class

# But, there are still references to unshaded guava
>javap -cp common/network-yarn/target/scala-2.11/spark*yarn-shuffle.jar -c 
>org/apache/spark/network/yarn/YarnShuffleService | grep google
  57: invokestatic  #139// Method 
com/google/common/collect/Lists.newArrayList:()Ljava/util/ArrayList;
{code}

2. There are references to scala classes in the uber jar:
{code}
jar -tf 
/opt/src/spark/common/network-yarn/target/scala-2.11/spark-*yarn-shuffle.jar | 
grep "^scala"
scala/AnyVal.class
{code}

We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-15 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-20504:
---

You’re right this is really a headache. Java tools cannot extract several 
information `scalac` generated from the jars, such as the package private 
modifier, the private class modifier, and so on.
Through scala-doc to fetch and compare these change will be a terrible thing by 
programming processing I hope to avoid this way.
But these information was still reserved in the jar, I think we can get them 
through `scala reflection api`,
http://docs.scala-lang.org/overviews/reflection/overview.html
but I need to study them first.

Thanks!

Sent from Windows Mail

From: Joseph K. Bradley (JIRA)
Sent: ‎Tuesday‎, ‎May‎ ‎16‎, ‎2017 ‎7‎:‎24‎ ‎AM
To: weichenxu...@outlook.com


[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011293#comment-16011293
 ]

Joseph K. Bradley edited comment on SPARK-20504 at 5/15/17 11:23 PM:
-

I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (Update: This foreachActive API will be 
OK to leave as is.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.



was (Author: josephkb):
I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (I'll ping on that PR to see what people 
want to do.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.2.0
>
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail:

[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2017-05-15 Thread Weiqing Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011556#comment-16011556
 ] 

Weiqing Yang commented on SPARK-6628:
-

Hi [~srowen], I just submitted a PR for this. could you please help to review 
it? Thanks.

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20700:
---

Assignee: Jiang Xingbo

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);
> SELECT
> AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
> float_col,
> COUNT(t1.smallint_col_2) AS int_col
> FROM table_5 t1
> INNER JOIN (
> SELECT
> (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
> (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
> AS boolean_col,
> t2.a,
> (t1.int_col_4) * (t1.int_col_4) AS int_col
> FROM table_5 t1
> LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
> WHERE
> (t1.smallint_col_2) > (t1.smallint_col_2)
> GROUP BY
> t2.a,
> (t1.int_col_4) * (t1.int_col_4)
> HAVING
> ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
> SUM(t1.int_col_4))
> ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
> ((t2.a) = (t1.smallint_col_2));
> {code}
> (I haven't tried to minimize this failing case yet).
> Based on sampled jstacks from the driver, it looks like the query might be 
> repeatedly inferring filters from constraints and then pruning those filters.
> Here's part of the stack at the point where it stackoverflows:
> {code}
> [... repeats ...]
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:344)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>

[jira] [Commented] (SPARK-20755) UDF registration should throw exception if UDF not found on classpath

2017-05-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011547#comment-16011547
 ] 

Xiao Li commented on SPARK-20755:
-

Will do it. This is duplicate to the previous PR. 

> UDF registration should throw exception if UDF not found on classpath
> -
>
> Key: SPARK-20755
> URL: https://issues.apache.org/jira/browse/SPARK-20755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeremy Beard
>Assignee: Xiao Li
>
> UDF registration currently logs an error message if the UDF was not 
> registered because it was not found on the classpath. If it threw an 
> exception then the problem would be more obvious in debugging and/or it could 
> be caught



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011545#comment-16011545
 ] 

Apache Spark commented on SPARK-6628:
-

User 'weiqingy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17989

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20755) UDF registration should throw exception if UDF not found on classpath

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20755:
---

Assignee: Xiao Li

> UDF registration should throw exception if UDF not found on classpath
> -
>
> Key: SPARK-20755
> URL: https://issues.apache.org/jira/browse/SPARK-20755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jeremy Beard
>Assignee: Xiao Li
>
> UDF registration currently logs an error message if the UDF was not 
> registered because it was not found on the classpath. If it threw an 
> exception then the problem would be more obvious in debugging and/or it could 
> be caught



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20755) UDF registration should throw exception if UDF not found on classpath

2017-05-15 Thread Jeremy Beard (JIRA)

Jeremy Beard created SPARK-20755:


 Summary: UDF registration should throw exception if UDF not found 
on classpath
 Key: SPARK-20755
 URL: https://issues.apache.org/jira/browse/SPARK-20755
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Jeremy Beard


UDF registration currently logs an error message if the UDF was not registered 
because it was not found on the classpath. If it threw an exception then the 
problem would be more obvious in debugging and/or it could be caught



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20588.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.TimeZone.getTimeZone(TimeZone.java:516)
> - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
> java.util.TimeZone)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Can we cache the locale's once per JVM so that we don't do this for every 
> record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20588:
---

Assignee: Takuya Ueshin

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>Assignee: Takuya Ueshin
> Fix For: 2.2.0
>
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.TimeZone.getTimeZone(TimeZone.java:516)
> - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
> java.util.TimeZone)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Can we cache the locale's once per JVM so that we don't do this for every 
> record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2017-05-15 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011504#comment-16011504
 ] 

Maciej Szymkiewicz commented on SPARK-18825:


It is a bit hack but I made some experiments and patched Knitr to:

- Remove {{-method}} entries from {{00index.html}}
- Strip {{donrun}} comments so we can run {{examples}} as a part of docs build.

Right now it will fail on build (not all examples are runnable) but it could be 
a path worth exploring. Here is the branch: 
https://github.com/zero323/knitr/tree/SPARK-DOCS 

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011293#comment-16011293
 ] 

Joseph K. Bradley edited comment on SPARK-20504 at 5/15/17 11:23 PM:
-

I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (Update: This foreachActive API will be 
OK to leave as is.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.



was (Author: josephkb):
I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (I'll ping on that PR to see what people 
want to do.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.2.0
>
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2017-05-15 Thread Weiqing Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011490#comment-16011490
 ] 

Weiqing Yang commented on SPARK-6628:
-

We met with this issue too.

The major issue is:
{code}
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat 
{code}
cannot be cast to
{code}
 org.apache.hadoop.hive.ql.io.HiveOutputFormat
{code}
The reason is:
{code}
public interface HiveOutputFormat extends OutputFormat {…}

public class HiveHBaseTableOutputFormat extends
TableOutputFormat implements
OutputFormat {...}
{code}

>From the two snippets above, we can see both HiveHBaseTableOutputFormat and 
>HiveOutputFormat 'extends' /'implements' OutputFormat, and can not cast to 
>each other. 

Spark initials the outputformat in SparkHiveWriterContainer of Spark 1.6, 2.0, 
2.1 (or: in HiveFileFormat of Spark 2.2 /Master)
{code}
@transient private lazy val outputFormat =
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]
{code}
Notice: this file output format is {color:red}HiveOutputFormat{color}
However, when users write the data into the hbase, the outputFormat is 
HiveHBaseTableOutputFormat, it isn't instance of HiveOutputFormat.

I am going to submit a PR for this.


> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20754:

Labels: starter  (was: )

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20754:

Parent Issue: SPARK-20746  (was: SPARK-2076)

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20754:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-2076

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19238) Ignore sorting the edges if edges are sorted when building edge partition

2017-05-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19238:
--
Fix Version/s: (was: 2.2.0)

> Ignore sorting the edges if edges are sorted when building edge partition
> -
>
> Key: SPARK-19238
> URL: https://issues.apache.org/jira/browse/SPARK-19238
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.1.0
>Reporter: Liu Shaohui
>Priority: Minor
>
> Usually the graph edges generated by upstream application and saved by other 
> graphs are sorted. So the sorting is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20754:

Summary: Add Function Alias For MOD/TRUNCT/POSITION  (was: Add Function 
Alias For )

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20754) Add Function Alias For

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20754:
---

 Summary: Add Function Alias For 
 Key: SPARK-20754
 URL: https://issues.apache.org/jira/browse/SPARK-20754
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


We already have the impl of the following functions. We can add the function 
alias to be consistent with ANSI. 

{noformat} 
MOD(,)
{noformat} 
Returns the remainder of m divided by n. Returns m if n is 0.

{noformat} 
TRUNC
{noformat} 
Returns the number x, truncated to D decimals. If D is 0, the result will have 
no decimal point or fractional part. If D is negative, the number is zeroed out.

{noformat} 
POSITION
{noformat} 
Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-20753) Build-in SQL Function Support - LOCALTIMESTAMP

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li deleted SPARK-20753:



> Build-in SQL Function Support - LOCALTIMESTAMP
> --
>
> Key: SPARK-20753
> URL: https://issues.apache.org/jira/browse/SPARK-20753
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Xiao Li
>
> {{LOCALTIMESTAMP}}
> - Current date and time in the server’s time zone are returned as TIMESTAMP 
> data type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20752) Build-in SQL Function Support - SQRT

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20752:

Summary: Build-in SQL Function Support - SQRT  (was: Build-in SQL Function 
Support - SQUARE)

> Build-in SQL Function Support - SQRT
> 
>
> Key: SPARK-20752
> URL: https://issues.apache.org/jira/browse/SPARK-20752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> SQRT()
> {noformat}
> Returns Power(, 2)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20752) Build-in SQL Function Support - SQUARE

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20752:

Description: 
{noformat}
SQRT()
{noformat}
Returns Power(, 2)


  was:
{noformat}
SQUARE()
{noformat}
Returns Power(, 2)



> Build-in SQL Function Support - SQUARE
> --
>
> Key: SPARK-20752
> URL: https://issues.apache.org/jira/browse/SPARK-20752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> SQRT()
> {noformat}
> Returns Power(, 2)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20752) Build-in SQL Function Support - SQUARE

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20752:

Summary: Build-in SQL Function Support - SQUARE  (was: Build-in Function 
Support - SQUARE)

> Build-in SQL Function Support - SQUARE
> --
>
> Key: SPARK-20752
> URL: https://issues.apache.org/jira/browse/SPARK-20752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> SQUARE()
> {noformat}
> Returns Power(, 2)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20753) Build-in SQL Function Support - LOCALTIMESTAMP

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20753:
---

 Summary: Build-in SQL Function Support - LOCALTIMESTAMP
 Key: SPARK-20753
 URL: https://issues.apache.org/jira/browse/SPARK-20753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{{LOCALTIMESTAMP}}

- Current date and time in the server’s time zone are returned as TIMESTAMP 
data type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20752) Build-in Function Support - SQUARE

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20752:
---

 Summary: Build-in Function Support - SQUARE
 Key: SPARK-20752
 URL: https://issues.apache.org/jira/browse/SPARK-20752
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
SQUARE()
{noformat}
Returns Power(, 2)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20751) Built-in SQL Function Support - COT

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20751:
---

 Summary: Built-in SQL Function Support - COT
 Key: SPARK-20751
 URL: https://issues.apache.org/jira/browse/SPARK-20751
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
COT()
{noformat}
Returns the cotangent of .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20750) Built-in SQL Function Support - REPLACE

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20750:

Labels: starter  (was: )

> Built-in SQL Function Support - REPLACE
> ---
>
> Key: SPARK-20750
> URL: https://issues.apache.org/jira/browse/SPARK-20750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> REPLACE(,  [, ])
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20750) Built-in SQL Function Support - REPLACE

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20750:
---

 Summary: Built-in SQL Function Support - REPLACE
 Key: SPARK-20750
 URL: https://issues.apache.org/jira/browse/SPARK-20750
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
REPLACE(,  [, ])
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20748) Built-in SQL Function Support - CH[A]R

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20748:

Summary: Built-in SQL Function Support - CH[A]R  (was: Support CH[A]R)

> Built-in SQL Function Support - CH[A]R
> --
>
> Key: SPARK-20748
> URL: https://issues.apache.org/jira/browse/SPARK-20748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> CH[A]R()
> {noformat}
> Returns a character when given its ASCII code.
> Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20749) Built-in SQL Function Support - all variants of LEN[GTH]

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20749:

Labels: starter  (was: )

> Built-in SQL Function Support - all variants of LEN[GTH]
> 
>
> Key: SPARK-20749
> URL: https://issues.apache.org/jira/browse/SPARK-20749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> LEN[GTH]()
> {noformat}
> The SQL 99 standard includes BIT_LENGTH(), CHAR_LENGTH(), and OCTET_LENGTH() 
> functions.
> We need to support all of them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20749) Built-in SQL Function Support - all variants of LEN[GTH]

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20749:
---

 Summary: Built-in SQL Function Support - all variants of LEN[GTH]
 Key: SPARK-20749
 URL: https://issues.apache.org/jira/browse/SPARK-20749
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
LEN[GTH]()
{noformat}

The SQL 99 standard includes BIT_LENGTH(), CHAR_LENGTH(), and OCTET_LENGTH() 
functions.

We need to support all of them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20748) Support CH[A]R

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20748:

Description: 
{noformat}
CH[A]R()
{noformat}
Returns a character when given its ASCII code.

Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm

  was:
{noformat}
CH[A]R()
{noformat}
Returns a character when given its ASCII code.


> Support CH[A]R
> --
>
> Key: SPARK-20748
> URL: https://issues.apache.org/jira/browse/SPARK-20748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> CH[A]R()
> {noformat}
> Returns a character when given its ASCII code.
> Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20748) Support CH[A]R

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20748:
---

 Summary: Support CH[A]R
 Key: SPARK-20748
 URL: https://issues.apache.org/jira/browse/SPARK-20748
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
CH[A]R()
{noformat}
Returns a character when given its ASCII code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20748) Support CH[A]R

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20748:

Labels: starter  (was: )

> Support CH[A]R
> --
>
> Key: SPARK-20748
> URL: https://issues.apache.org/jira/browse/SPARK-20748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> CH[A]R()
> {noformat}
> Returns a character when given its ASCII code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20747) Distinct in Aggregate Functions

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20747:
---

 Summary: Distinct in Aggregate Functions
 Key: SPARK-20747
 URL: https://issues.apache.org/jira/browse/SPARK-20747
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
AVG ([DISTINCT]|[ALL] )
MAX ([DISTINCT]|[ALL] )
MIN ([DISTINCT]|[ALL] )
SUM ([DISTINCT]|[ALL] )
{noformat}
Except COUNT, the DISTINCT clause is not supported by Spark SQL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20746) Built-in SQL Function Improvement

2017-05-15 Thread Xiao Li (JIRA)

Xiao Li created SPARK-20746:
---

 Summary: Built-in SQL Function Improvement
 Key: SPARK-20746
 URL: https://issues.apache.org/jira/browse/SPARK-20746
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


SQL functions are part of the core of the ISO/ANSI standards. This umbrella 
JIRA is trying to list all the ISO/ANS SQL functions that are not fully 
implemented by Spark SQL, fix the documentation and test case issues in the 
supported functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20504) ML 2.2 QA: API: Java compatibility, docs

2017-05-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011293#comment-16011293
 ] 

Joseph K. Bradley commented on SPARK-20504:
---

I found one issue: This won't pick up on methods which were package private in 
2.1 and were made public in 2.2.  E.g.: Matrix.foreachActive was made public in 
[SPARK-17471] here: 
https://github.com/apache/spark/pull/15628/files#diff-440e1b707197e577b932a055ab16293eR158
But it does not show up in the diff.  (I'll ping on that PR to see what people 
want to do.)

We won't be able to identify these cases from the JARs; we'll have to rely on 
the docs.

> ML 2.2 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-20504
> URL: https://issues.apache.org/jira/browse/SPARK-20504
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.2.0
>
> Attachments: 1_process_script.sh, 2_signature.diff, 
> 3_added_class_signature, 4_common_ml_class, 5_added_ml_class
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20735) Enable cross join in TPCDSQueryBenchmark

2017-05-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20735.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.2.0
   2.1.2

> Enable cross join in TPCDSQueryBenchmark
> 
>
> Key: SPARK-20735
> URL: https://issues.apache.org/jira/browse/SPARK-20735
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
>
> Since SPARK-17298, some queries (q28, q61, q77, q88, q90) fails with a 
> message "Use the CROSS JOIN syntax to allow cartesian products between these 
> relations".
> This issue aims to enable the correct configuration in 
> `TPCDSQueryBenchmark.scala`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-05-15 Thread Amit Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011024#comment-16011024
 ] 

Amit Kumar commented on SPARK-20589:


I can probably give more reference. This originally arose from our use case 
when we are working with Images (Binary Data) PairRDD  (url, imageData) . The 
pipeline works mostly as map tasks on the PairRdd with the eventual step being 
uploading it to a Storage Service. 
Now the problem is that the RDD could be huge and it would be expensive to 
persist it before the coalesce, and on the other hand, without persisting the 
reduce parallelism starts affecting the earlier stages.

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20666:
-
Fix Version/s: (was: 2.2.1)
   (was: 2.3.0)
   2.2.0

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in 
> other test runs too, always only when running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at

[jira] [Updated] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20666:
-
Affects Version/s: (was: 2.3.0)
   2.2.0

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in 
> other test runs too, always only when running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2907)
>   at 
>

[jira] [Issue Comment Deleted] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20666:
-
Comment: was deleted

(was: User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/17966)

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in 
> other test runs too, always only when running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at

[jira] [Resolved] (SPARK-20716) StateStore.abort() should not throw further exception

2017-05-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20716.
--
Resolution: Fixed

> StateStore.abort() should not throw further exception
> -
>
> Key: SPARK-20716
> URL: https://issues.apache.org/jira/browse/SPARK-20716
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> StateStore.abort() should do a best effort attempt to clean up temporary 
> resources. It should not throw errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17729) Enable creating hive bucketed tables

2017-05-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-17729:
---

Assignee: Tejas Patil

> Enable creating hive bucketed tables
> 
>
> Key: SPARK-17729
> URL: https://issues.apache.org/jira/browse/SPARK-17729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Hive allows inserting data to bucketed table without guaranteeing bucketed 
> and sorted-ness based on these two configs : `hive.enforce.bucketing` and 
> `hive.enforce.sorting`. 
> With this jira, Spark still won't produce bucketed data as per Hive's 
> bucketing guarantees, but will allow writes IFF user wishes to do so without 
> caring about bucketing guarantees. Ability to create bucketed tables will 
> enable adding test cases to Spark while pieces are being added to Spark have 
> it support hive bucketing (eg. https://github.com/apache/spark/pull/15229)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20717) Tweak MapGroupsWithState update function behavior

2017-05-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20717.
--
Resolution: Fixed

> Tweak MapGroupsWithState update function behavior
> -
>
> Key: SPARK-20717
> URL: https://issues.apache.org/jira/browse/SPARK-20717
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Timeout and state data are two independent entities and should be settable 
> independently. Therefore, in the same call of the user-defined function, one 
> should be able to set the timeout before initializing the state and also 
> after removing the state. Whether timeouts can be set or not should not 
> depend on the current state, and vice versa. 
> However, a limitation of the current implementation is that state cannot be 
> null while timeout is set. This is checked lazily after the function call has 
> completed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17729) Enable creating hive bucketed tables

2017-05-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17729.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17644
[https://github.com/apache/spark/pull/17644]

> Enable creating hive bucketed tables
> 
>
> Key: SPARK-17729
> URL: https://issues.apache.org/jira/browse/SPARK-17729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Hive allows inserting data to bucketed table without guaranteeing bucketed 
> and sorted-ness based on these two configs : `hive.enforce.bucketing` and 
> `hive.enforce.sorting`. 
> With this jira, Spark still won't produce bucketed data as per Hive's 
> bucketing guarantees, but will allow writes IFF user wishes to do so without 
> caring about bucketing guarantees. Ability to create bucketed tables will 
> enable adding test cases to Spark while pieces are being added to Spark have 
> it support hive bucketing (eg. https://github.com/apache/spark/pull/15229)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010967#comment-16010967
 ] 

Reynold Xin commented on SPARK-12297:
-

Can you clarify what you mean that with other formats (e.g. CSV) Spark SQL 
allows timestamp without timezone?

BTW if we really need this, I'd do a logical rewrite to inject timezone 
conversion arithmetics, rather than just hacking all the random places in 
physical execution. 

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11034) Launcher: add support for monitoring Mesos apps

2017-05-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010930#comment-16010930
 ] 

Marcelo Vanzin commented on SPARK-11034:


It's up for someone interested in supporting this on Mesos to write the code.

> Launcher: add support for monitoring Mesos apps
> ---
>
> Key: SPARK-11034
> URL: https://issues.apache.org/jira/browse/SPARK-11034
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> The code to monitor apps launched using the launcher library was added in 
> SPARK-8673, but the backend does not support monitoring apps launched through 
> Mesos yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11034) Launcher: add support for monitoring Mesos apps

2017-05-15 Thread Avinash Venkateshaiah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010927#comment-16010927
 ] 

Avinash Venkateshaiah commented on SPARK-11034:
---

Is there any plan to support this ?

> Launcher: add support for monitoring Mesos apps
> ---
>
> Key: SPARK-11034
> URL: https://issues.apache.org/jira/browse/SPARK-11034
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> The code to monitor apps launched using the launcher library was added in 
> SPARK-8673, but the backend does not support monitoring apps launched through 
> Mesos yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20501) ML, Graph 2.2 QA: API: New Scala APIs, docs

2017-05-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010914#comment-16010914
 ] 

Joseph K. Bradley commented on SPARK-20501:
---

Also:
* {{LinearSVC}}
* {{AssociationRules}}
* {{ChiSquareTest}}
* {{Correlation}}

> ML, Graph 2.2 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-20501
> URL: https://issues.apache.org/jira/browse/SPARK-20501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20666) Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError

2017-05-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20666.

   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0
   2.2.1

> Flaky test - SparkListenerBus randomly failing java.lang.IllegalAccessError
> ---
>
> Key: SPARK-20666
> URL: https://issues.apache.org/jira/browse/SPARK-20666
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core, SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> seeing quite a bit of this on AppVeyor, aka Windows only,-> seems like in 
> other test runs too, always only when running ML tests, it seems
> {code}
> Exception in thread "SparkListenerBus" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 159454
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)
>   at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:88)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetric.toInfo(SQLMetrics.scala:67)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener$$anonfun$onTaskEnd$1.apply(SQLListener.scala:216)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:216)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:36)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:94)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1268)
>   at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77)
> 1
> MLlib recommendation algorithms: Spark package found in SPARK_HOME: 
> C:\projects\spark\bin\..
> {code}
> {code}
> java.lang.IllegalStateException: SparkContext has been shutdown
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2015)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2044)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2063)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2088)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2923)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2474)
>   at

[jira] [Resolved] (SPARK-20742) SparkAppHandle.getState() doesnt return the right state when the launch is done on a mesos master in cluster mode

2017-05-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20742.

Resolution: Duplicate

> SparkAppHandle.getState() doesnt return the right state when the launch is 
> done on a mesos master in cluster mode
> -
>
> Key: SPARK-20742
> URL: https://issues.apache.org/jira/browse/SPARK-20742
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Avinash Venkateshaiah
>Priority: Critical
>
> I am launching a spark application on a mesos master in cluster mode using 
> the SparkLauncher. This returns the handle (SparkAppHandle) . However when i 
> try to check the state using handle.getState() it is always UNKOWN even 
> though the job was submitted and completed successfully.
> I'am using spark 2.1.0
> Thanks,
> Avinash



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010748#comment-16010748
 ] 

Apache Spark commented on SPARK-18922:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17987

> Fix more resource-closing-related and path-related test failures in 
> identified ones on Windows
> --
>
> Key: SPARK-18922
> URL: https://issues.apache.org/jira/browse/SPARK-18922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> There are more instances that are failed on Windows as below:
> - {{LauncherBackendSuite}}:
> {code}
> - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds)
>   The code passed to eventually never returned normally. Attempted 283 times 
> over 30.0960053 seconds. Last failure message: The reference was null. 
> (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 
> milliseconds)
>   The code passed to eventually never returned normally. Attempted 282 times 
> over 30.03798710002 seconds. Last failure message: The reference was 
> null. (LauncherBackendSuite.scala:56)
>   org.scalatest.exceptions.TestFailedDueToTimeoutException:
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
> {code}
> - {{SQLQuerySuite}}:
> {code}
> - specifying database name for a temporary table is not allowed *** FAILED 
> *** (125 milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{JsonSuite}}:
> {code}
> - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 
> milliseconds)
>   org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/C:projectsspark  arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {code}
> - {{StateStoreSuite}}:
> {code}
> - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds)
>   java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:116)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   ...
>   Cause: java.net.URISyntaxException: Relative path in absolute URI: 
> StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:203)
> {code}
> - {{HDFSMetadataLogSuite}}:
> {code}
> - FileManager: FileContextManager *** FAILED *** (94 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> - FileManager: FileSystemManager *** FAILED *** (78 milliseconds)
>   java.io.IOException: Failed to delete: 
> C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38)
> {code}
> Please refer, for full logs, 
>

[jira] [Commented] (SPARK-19707) Improve the invalid path check for sc.addJar

2017-05-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010747#comment-16010747
 ] 

Apache Spark commented on SPARK-19707:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17987

> Improve the invalid path check for sc.addJar
> 
>
> Key: SPARK-19707
> URL: https://issues.apache.org/jira/browse/SPARK-19707
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.1.1, 2.2.0
>
>
> Currently in Spark there're two issues when we add jars with invalid path:
> * If the jar path is a empty string {--jar ",dummy.jar"}, then Spark will 
> resolve it to the current directory path and add to classpath / file server, 
> which is unwanted.
> * If the jar path is a invalid path (file doesn't exist), file server doesn't 
> check this and will still added file server, the exception will be thrown 
> until job is running. This local path could be checked immediately, no need 
> to wait until task running. We have similar check in {{addFile}}, but lacks 
> similar one in {{addJar}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20744) Predicates with multiple columns do not work

2017-05-15 Thread Bogdan Raducanu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-20744:

Description: 
The following code reproduces the problem:

{code}
scala> spark.range(10).selectExpr("id as a", "id as b").where("(a,b) in 
((1,1))").show
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', `a`, 
'b', `b`) IN (named_struct('col1', 1, 'col2', 1)))' due to data type mismatch: 
Arguments must be same type; line 1 pos 6;
'Filter named_struct(a, a#42L, b, b#43L) IN (named_struct(col1, 1, col2, 1))
+- Project [id#39L AS a#42L, id#39L AS b#43L]
   +- Range (0, 10, step=1, splits=Some(1))
{code}

Similarly it won't work from SQL either, which is something that other SQL DB 
support:

{code}
scala> spark.range(10).selectExpr("id as a", "id as 
b").createOrReplaceTempView("tab1")

scala> sql("select * from tab1 where (a,b) in ((1,1), (2,2))").show
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
tab1.`a`, 'b', tab1.`b`) IN (named_struct('col1', 1, 'col2', 1), 
named_struct('col1', 2, 'col2', 2)))' due to data type mismatch: Arguments must 
be same type; line 1 pos 31;
'Project [*]
+- 'Filter named_struct(a, a#50L, b, b#51L) IN (named_struct(col1, 1, col2, 
1),named_struct(col1, 2, col2, 2))
   +- SubqueryAlias tab1
  +- Project [id#47L AS a#50L, id#47L AS b#51L]
 +- Range (0, 10, step=1, splits=Some(1))
{code}

Other examples:
{code}
scala> sql("select * from tab1 where (a,b) =(1,1)").show
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' due to data 
type mismatch: differing types in '(named_struct('a', tab1.`a`, 'b', tab1.`b`) 
= named_struct('col1', 1, 'col2', 1))' (struct and 
struct).; line 1 pos 25;
'Project [*]
+- 'Filter (named_struct(a, a#50L, b, b#51L) = named_struct(col1, 1, col2, 1))
   +- SubqueryAlias tab1
  +- Project [id#47L AS a#50L, id#47L AS b#51L]
 +- Range (0, 10, step=1, splits=Some(1))
{code}

Expressions such as (1,1) are apparently read as structs and then the types do 
not match. Perhaps they should be arrays.
The following code works:
{code}
sql("select * from tab1 where array(a,b) in (array(1,1),array(2,2))").show
{code}

This also works, but requires the cast:
{code}
sql("select * from tab1 where (a,b) in (named_struct('a', cast(1 as bigint), 
'b', cast(1 as bigint)))").show
{code}


  was:
The following code reproduces the problem:

{code}
scala> spark.range(10).selectExpr("id as a", "id as b").where("(a,b) in 
((1,1))").show
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', `a`, 
'b', `b`) IN (named_struct('col1', 1, 'col2', 1)))' due to data type mismatch: 
Arguments must be same type; line 1 pos 6;
'Filter named_struct(a, a#42L, b, b#43L) IN (named_struct(col1, 1, col2, 1))
+- Project [id#39L AS a#42L, id#39L AS b#43L]
   +- Range (0, 10, step=1, splits=Some(1))
{code}

Similarly it won't work from SQL either, which is something that other SQL DB 
support:

{code}
scala> spark.range(10).selectExpr("id as a", "id as 
b").createOrReplaceTempView("tab1")

scala> sql("select * from tab1 where (a,b) in ((1,1), (2,2))").show
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
tab1.`a`, 'b', tab1.`b`) IN (named_struct('col1', 1, 'col2', 1), 
named_struct('col1', 2, 'col2', 2)))' due to data type mismatch: Arguments must 
be same type; line 1 pos 31;
'Project [*]
+- 'Filter named_struct(a, a#50L, b, b#51L) IN (named_struct(col1, 1, col2, 
1),named_struct(col1, 2, col2, 2))
   +- SubqueryAlias tab1
  +- Project [id#47L AS a#50L, id#47L AS b#51L]
 +- Range (0, 10, step=1, splits=Some(1))
{code}

Other examples:
{code}
scala> sql("select * from tab1 where (a,b) =(1,1)").show
org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('a', 
tab1.`a`, 'b', tab1.`b`) = named_struct('col1', 1, 'col2', 1))' due to data 
type mismatch: differing types in '(named_struct('a', tab1.`a`, 'b', tab1.`b`) 
= named_struct('col1', 1, 'col2', 1))' (struct and 
struct).; line 1 pos 25;
'Project [*]
+- 'Filter (named_struct(a, a#50L, b, b#51L) = named_struct(col1, 1, col2, 1))
   +- SubqueryAlias tab1
  +- Project [id#47L AS a#50L, id#47L AS b#51L]
 +- Range (0, 10, step=1, splits=Some(1))
{code}

Expressions such as (1,1) are apparently read as structs and then the types do 
not match. Perhaps they should be arrays.
The following code works:
{code}
sql("select * from tab1 where array(a,b) in (array(1,1),array(2,2))").show
{code}



> Predicates with multiple columns do not work
> 
>
> Key: SPARK-20744
> URL: https://issues.apache.org/jira/browse/SPARK-20744
> Project: Spark
>

[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing

2017-05-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010709#comment-16010709
 ] 

Sean Owen commented on SPARK-18359:
---

This behavior was not correct, or at least more problematic, because behavior 
varied even within one cluster according to the particular of the environment. 
The machine env shouldn't affect correctness this way.

> Let user specify locale in CSV parsing
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing

2017-05-15 Thread Alexander Enns (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010704#comment-16010704
 ] 

Alexander Enns commented on SPARK-18359:


IMHO the changes done for SPARK-18076 need to be reverted. 
The former behaviour and usage of JVM locale was correct.



> Let user specify locale in CSV parsing
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20669) LogisticRegression family should be case insensitive

2017-05-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-20669.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> LogisticRegression family should be case insensitive 
> -
>
> Key: SPARK-20669
> URL: https://issues.apache.org/jira/browse/SPARK-20669
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{LogisticRegression}} family should be case insensitive 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20669) LogisticRegression family should be case insensitive

2017-05-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-20669:
---

Assignee: zhengruifeng

> LogisticRegression family should be case insensitive 
> -
>
> Key: SPARK-20669
> URL: https://issues.apache.org/jira/browse/SPARK-20669
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{LogisticRegression}} family should be case insensitive 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20740) Expose UserDefinedType make sure could extends it

2017-05-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010477#comment-16010477
 ] 

Hyukjin Kwon edited comment on SPARK-20740 at 5/15/17 2:57 PM:
---

Probably, we should describe the use case here. BTW, it looks not a bug.


was (Author: hyukjin.kwon):
Probably, we should describe the usage here. BTW, it looks not a bug.

> Expose UserDefinedType make sure could extends it
> -
>
> Key: SPARK-20740
> URL: https://issues.apache.org/jira/browse/SPARK-20740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> User may want to extends UserDefinedType and create data types . We should 
> make UserDefinedType as a public class .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20745) Data gets wrongly copied from one row to others, possibly related to named structs

2017-05-15 Thread Martin Mauch (JIRA)

Martin Mauch created SPARK-20745:


 Summary: Data gets wrongly copied from one row to others, possibly 
related to named structs
 Key: SPARK-20745
 URL: https://issues.apache.org/jira/browse/SPARK-20745
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.1
Reporter: Martin Mauch


We encountered a strange bug where Spark copies data over from one row to other 
rows. It might be related to named structs, at least the minimal repro we were 
able to achieve involves them: 
https://github.com/crealytics/spark_bug/blob/master/src/test/scala/spark/DataFrameConversionsSpec.scala
The interesting part is that Spark behaves correctly when the DataFrame is 
cached (see the 2nd example) and also if you run the failing example a second 
time (see 1st vs 3rd example).
You should be able to check out the above project and reproduce the problem with
sbt test



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-05-15 Thread Peter Halverson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010621#comment-16010621
 ] 

Peter Halverson edited comment on SPARK-18004 at 5/15/17 2:37 PM:
--

Here's the culprit, a private function that converts a scala value to a SQL 
literal. Note use of {{toString}} to format dates and timestamps. Seems like 
there should be some hook to a {{JDBCDialect}} that can handle vendor-specific 
syntax etc.

{code:lang=java}
  /**
   * Converts value to SQL expression.
   */
  private def compileValue(value: Any): Any = value match {
case stringValue: String => s"'${escapeSql(stringValue)}'"
case timestampValue: Timestamp => "'" + timestampValue + "'"
case dateValue: Date => "'" + dateValue + "'"
case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
case _ => value
  }
{code}


was (Author: phalverson):
Here's the culprit, a private function that converts a scala value to a SQL 
literal. Note use of {{toString}} to format dates and timestamps. Seems like 
there should be some hook to a {{JDBCDialect here}} that can handle 
vendor-specific syntax etc.

{code:lang=java}
  /**
   * Converts value to SQL expression.
   */
  private def compileValue(value: Any): Any = value match {
case stringValue: String => s"'${escapeSql(stringValue)}'"
case timestampValue: Timestamp => "'" + timestampValue + "'"
case dateValue: Date => "'" + dateValue + "'"
case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
case _ => value
  }
{code}

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at

[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-05-15 Thread Peter Halverson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010621#comment-16010621
 ] 

Peter Halverson commented on SPARK-18004:
-

Here's the culprit, a private function that converts a scala value to a SQL 
literal. Note use of {{toString}} to format dates and timestamps. Seems like 
there should be some hook to a {{JDBCDialect here}} that can handle 
vendor-specific syntax etc.

{code:lang=java}
  /**
   * Converts value to SQL expression.
   */
  private def compileValue(value: Any): Any = value match {
case stringValue: String => s"'${escapeSql(stringValue)}'"
case timestampValue: Timestamp => "'" + timestampValue + "'"
case dateValue: Date => "'" + dateValue + "'"
case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
case _ => value
  }
{code}

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
>

[jira] [Updated] (SPARK-20742) SparkAppHandle.getState() doesnt return the right state when the launch is done on a mesos master in cluster mode

2017-05-15 Thread Avinash Venkateshaiah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Venkateshaiah updated SPARK-20742:
--
Affects Version/s: 2.1.0

> SparkAppHandle.getState() doesnt return the right state when the launch is 
> done on a mesos master in cluster mode
> -
>
> Key: SPARK-20742
> URL: https://issues.apache.org/jira/browse/SPARK-20742
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2, 2.1.0, 2.1.1
>Reporter: Avinash Venkateshaiah
>Priority: Critical
>
> I am launching a spark application on a mesos master in cluster mode using 
> the SparkLauncher. This returns the handle (SparkAppHandle) . However when i 
> try to check the state using handle.getState() it is always UNKOWN even 
> though the job was submitted and completed successfully.
> I'am using spark 2.1.0
> Thanks,
> Avinash



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 181 matches

Mail list logo