[jira] [Assigned] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17637:


Assignee: Apache Spark  (was: Zhan Zhang)

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17637:


Assignee: Zhan Zhang  (was: Apache Spark)

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-17637:
-

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579382#comment-15579382
 ] 

Reynold Xin commented on SPARK-17637:
-

Note: I reverted the commit due to quality issues. We can improve it and merge 
it again.


> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17958) Why I ran into issue " accumulator, copyandreset must be zero error"

2016-10-15 Thread dianwei Han (JIRA)
dianwei Han created SPARK-17958:
---

 Summary: Why I ran into issue " accumulator, copyandreset must be 
zero error"
 Key: SPARK-17958
 URL: https://issues.apache.org/jira/browse/SPARK-17958
 Project: Spark
  Issue Type: Question
  Components: Java API
Affects Versions: 2.0.1
 Environment: linux. 
Reporter: dianwei Han
Priority: Blocker


I used to run spark code under spark 1.5. No problem at all. right now, I ran 
code on spark 2.0. see error 

"Exception in thread "main" java.lang.AssertionError: assertion failed: 
copyAndReset must return a zero value copy
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Please help or give suggestion on how to fix the issue?

Really appreciate it,

Dianwei



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Linbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579294#comment-15579294
 ] 

Linbo commented on SPARK-17957:
---

Thank you!

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>Priority: Critical
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17957:

Priority: Critical  (was: Major)

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>Priority: Critical
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17957:

Target Version/s: 2.0.2, 2.1.0  (was: 2.0.2)

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579284#comment-15579284
 ] 

Shixiong Zhu commented on SPARK-13747:
--

[~chinwei] could you post the stack trace here?

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17957:

Labels: correctness  (was: joins na.fill)

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: correctness
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579282#comment-15579282
 ] 

Xiao Li commented on SPARK-17957:
-

Found the bug. 
{noformat}
Project [a#29, b#30, c#31, d#48]
+- Join Inner, (a#29 = a#47)
   :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) 
AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, 
cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
   :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 
0.0) as int))
   : +- Join FullOuter, (a#5 = a#15)
   ::- LocalRelation [a#5, b#6]
   :+- LocalRelation [a#15, c#16]
   +- LocalRelation [a#47, d#48]
{noformat}

Will fix it soon. 



> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: joins, na.fill
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579262#comment-15579262
 ] 

Tejas Patil commented on SPARK-17954:
-

I agree with [~srowen]'s comment about this being more of a networking or 
IP/hostname problem. Offcourse, this is based on the limited information that 
we have at hand.

Have you done an investigation around this issue ?
- Were you able to ping / telnet from one node to another while this exception 
happened ? If you are running in docker / other container stuff, you will have 
to do this check at the container level.
- You mentioned that things worked fine with 1.6. Was it running side by side 
in the same cluster, in the same time window when 2.0 had fetch failures ?

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 

[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579260#comment-15579260
 ] 

Xiao Li commented on SPARK-17957:
-

You can see the plan. The optimized plan is still full outer join. : ) Thus, it 
should not be caused by outer join elimination. 

{noformat}
val a = Seq((1, 2), (2, 3)).toDF("a", "b")
val b = Seq((2, 5), (3, 4)).toDF("a", "c")
val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
val c = Seq((3, 1)).toDF("a", "d")
ab.join(c, "a").explain(true)
{noformat}

{noformat}
== Optimized Logical Plan ==
Project [a#29, b#30, c#31, d#42]
+- Join Inner, (a#29 = a#41)
   :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) 
AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, 
cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
   :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 
0.0) as int))
   : +- Join FullOuter, (a#5 = a#15)
   ::- LocalRelation [a#5, b#6]
   :+- LocalRelation [a#15, c#16]
   +- LocalRelation [a#41, d#42]
{noformat}

Let me find what is the cause. 

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: joins, na.fill
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579252#comment-15579252
 ] 

Xiao Li commented on SPARK-17957:
-

Thank you for reporting it. Let me do a quick check.

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: joins, na.fill
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17812:


Assignee: Apache Spark  (was: Cody Koeninger)

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579206#comment-15579206
 ] 

Apache Spark commented on SPARK-17812:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15504

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17812:


Assignee: Cody Koeninger  (was: Apache Spark)

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation for Java at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 

{quote}
Nested JavaBeans and List or Array fields are supported though.
{quote}

However nested JavaBean is not working. Please see the below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample class

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above exception. But I observed that 
createDataset method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation for Java at 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation for Java at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 

{quote}
"Nested JavaBeans and List or Array fields are supported though". 
{quote}

However nested JavaBean is not working. Please see the below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample class

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above exception. But I observed that 
createDataset method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation for Java at 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation for Java at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 "Nested JavaBeans and List or Array fields are supported though". However 
nested JavaBean is not working. Please see the below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample class

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above exception. But I observed that 
createDataset method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation for Java at 

[jira] [Comment Edited] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Linbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579142#comment-15579142
 ] 

Linbo edited comment on SPARK-17957 at 10/16/16 2:26 AM:
-

cc [~smilegator] and [~dongjoon]


was (Author: linbojin):
cc [~smilegator]

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: joins, na.fill
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Linbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579142#comment-15579142
 ] 

Linbo commented on SPARK-17957:
---

cc [~smilegator]

> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: joins, na.fill
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> val c = Seq((3, 1)).toDF("a", "d")
> c: org.apache.spark.sql.DataFrame = [a: int, d: int]
> scala> c.show
> +---+---+
> |  a|  d|
> +---+---+
> |  3|  1|
> +---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> +---+---+---+---+
> {code}
> And again if i use persist, the result is correct. I think the problem is 
> join optimizer similar to this pr: https://github.com/apache/spark/pull/14661
> {code:title=spark-shell|borderStyle=solid}
> scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
> ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
> ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> ab.join(c, "a").show
> +---+---+---+---+
> |  a|  b|  c|  d|
> +---+---+---+---+
> |  3|  0|  4|  1|
> +---+---+---+---+
> {code}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Linbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Linbo updated SPARK-17957:
--
Description: 
I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when I 
insert a na.fill(0) call between outer join and inner join in the same workflow 
in SPARK-17060 I get wrong result.

{code:title=spark-shell|borderStyle=solid}
scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
a: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
b: org.apache.spark.sql.DataFrame = [a: int, c: int]

scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> ab.show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  0|
|  3|  0|  4|
|  2|  3|  5|
+---+---+---+

scala> val c = Seq((3, 1)).toDF("a", "d")
c: org.apache.spark.sql.DataFrame = [a: int, d: int]

scala> c.show
+---+---+
|  a|  d|
+---+---+
|  3|  1|
+---+---+

scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
+---+---+---+---+
{code}

And again if i use persist, the result is correct. I think the problem is join 
optimizer similar to this pr: https://github.com/apache/spark/pull/14661

{code:title=spark-shell|borderStyle=solid}
scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
... 1 more field]

scala> ab.show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  0|
|  3|  0|  4|
|  2|  3|  5|
+---+---+---+

scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  3|  0|  4|  1|
+---+---+---+---+
{code}
  

  was:
I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when I 
insert a na.fill(0) call between outer join and inner join in the same workflow 
in SPARK-17060 I get wrong result.

{code:title=spark-shell|borderStyle=solid}
scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
a: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
b: org.apache.spark.sql.DataFrame = [a: int, c: int]

scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> ab.show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  0|
|  3|  0|  4|
|  2|  3|  5|
+---+---+---+

scala> val c = Seq((3, 1)).toDF("a", "d")
c: org.apache.spark.sql.DataFrame = [a: int, d: int]

scala> c.show
+---+---+
|  a|  d|
+---+---+
|  3|  1|
+---+---+

scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
+---+---+---+---+

scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0)
ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
+---+---+---+---+
{code}

And again if i use persist, the result is correct. I think the problem is join 
optimizer similar to this pr: https://github.com/apache/spark/pull/14661

{code:title=spark-shell|borderStyle=solid}
scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
... 1 more field]

scala> ab.show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  0|
|  3|  0|  4|
|  2|  3|  5|
+---+---+---+


scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  3|  0|  4|  1|
+---+---+---+---+
{code}
  


> Calling outer join and na.fill(0) and then inner join will miss rows
> 
>
> Key: SPARK-17957
> URL: https://issues.apache.org/jira/browse/SPARK-17957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1, Mac, Local
>Reporter: Linbo
>  Labels: joins, na.fill
>
> I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
> https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when 
> I insert a na.fill(0) call between outer join and inner join in the same 
> workflow in SPARK-17060 I get wrong result.
> {code:title=spark-shell|borderStyle=solid}
> scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
> a: org.apache.spark.sql.DataFrame = [a: int, b: int]
> scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
> b: org.apache.spark.sql.DataFrame = [a: int, c: int]
> scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
> ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> ab.show
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  2|  0|
> |  3|  0|  4|
> |  2|  3|  5|
> +---+---+---+
> scala> 

[jira] [Created] (SPARK-17957) Calling outer join and na.fill(0) and then inner join will miss rows

2016-10-15 Thread Linbo (JIRA)
Linbo created SPARK-17957:
-

 Summary: Calling outer join and na.fill(0) and then inner join 
will miss rows
 Key: SPARK-17957
 URL: https://issues.apache.org/jira/browse/SPARK-17957
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
 Environment: Spark 2.0.1, Mac, Local

Reporter: Linbo


I reported a similar bug two months ago and it's fixed in Spark 2.0.1: 
https://issues.apache.org/jira/browse/SPARK-17060 But I find a new bug: when I 
insert a na.fill(0) call between outer join and inner join in the same workflow 
in SPARK-17060 I get wrong result.

{code:title=spark-shell|borderStyle=solid}
scala> val a = Seq((1, 2), (2, 3)).toDF("a", "b")
a: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val b = Seq((2, 5), (3, 4)).toDF("a", "c")
b: org.apache.spark.sql.DataFrame = [a: int, c: int]

scala> val ab = a.join(b, Seq("a"), "fullouter").na.fill(0)
ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> ab.show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  0|
|  3|  0|  4|
|  2|  3|  5|
+---+---+---+

scala> val c = Seq((3, 1)).toDF("a", "d")
c: org.apache.spark.sql.DataFrame = [a: int, d: int]

scala> c.show
+---+---+
|  a|  d|
+---+---+
|  3|  1|
+---+---+

scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
+---+---+---+---+

scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0)
ab: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]

scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
+---+---+---+---+
{code}

And again if i use persist, the result is correct. I think the problem is join 
optimizer similar to this pr: https://github.com/apache/spark/pull/14661

{code:title=spark-shell|borderStyle=solid}
scala> val ab = a.join(b, Seq("a"), "outer").na.fill(0).persist
ab: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: int, b: int 
... 1 more field]

scala> ab.show
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  0|
|  3|  0|  4|
|  2|  3|  5|
+---+---+---+


scala> ab.join(c, "a").show
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  3|  0|  4|  1|
+---+---+---+---+
{code}
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-17637:

Affects Version/s: 2.1.0

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-17637.
-
  Resolution: Fixed
Assignee: Zhan Zhang
Target Version/s: 2.1.0

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Zhan Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17637) Packed scheduling for Spark tasks across executors

2016-10-15 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-17637:

Fix Version/s: 2.1.0

> Packed scheduling for Spark tasks across executors
> --
>
> Key: SPARK-17637
> URL: https://issues.apache.org/jira/browse/SPARK-17637
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Zhan Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently Spark scheduler implements round robin scheduling for tasks to 
> executors. Which is great as it distributes the load evenly across the 
> cluster, but this leads to significant resource waste in some cases, 
> especially when dynamic allocation is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6

2016-10-15 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578987#comment-15578987
 ] 

ding edited comment on SPARK-17951 at 10/16/16 12:22 AM:
-

I tried to call rdd.collect which internally called bm.getRemoteBytes in 
multiple threads. The data shows difference of time spend on block fetch is 
also around 0.1s between spark 1.5.1 and spark 1.6.2. And it can be 
consistently repro.

In our scenario, we called getRemoteBytes twice. If we use spark 1.6.2 it will 
bring additional 0.2s overhead and cause around 5 decrease of throughput. we 
would like to reduce the overhead if possible.


was (Author: ding):
I tried to call rdd.collect which internally called bm.getRemoteBytes in 
multiple threads. The data shows difference of time spend on block fetch is 
also around 0.1s between spark 1.5.1 and spark 1.6.2. And it can be 
consistently repro.

In our scenario, we called getRemoteBytes twice. If we use spark 1.6.2 it will 
bring additional 0.2s overhead and cause around 5 decrease of throughput. we 
would like to reduce the overhead as much as possible.

> BlockFetch with multiple threads slows down after spark 1.6
> ---
>
> Key: SPARK-17951
> URL: https://issues.apache.org/jira/browse/SPARK-17951
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.2
> Environment: cluster with 8 node, each node has 28 cores. 10Gb network
>Reporter: ding
>
> The following code demonstrates the issue:
> {code}
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName(s"BMTest")
> val size = 3344570
> val sc = new SparkContext(conf)
> val data = sc.parallelize(1 to 100, 8)
> var accum = sc.accumulator(0.0, "get remote bytes")
> var i = 0
> while(i < 91) {
>   accum = sc.accumulator(0.0, "get remote bytes")
>   val test = data.mapPartitionsWithIndex { (pid, iter) =>
> val N = size
> val bm = SparkEnv.get.blockManager
> val blockId = TaskResultBlockId(10*i + pid)
> val test = new Array[Byte](N)
> Random.nextBytes(test)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> buffer.put(test)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
> Iterator(1)
>   }.count()
>   
>   data.mapPartitionsWithIndex { (pid, iter) =>
> val before = System.nanoTime()
> 
> val bm = SparkEnv.get.blockManager
> (0 to 7).map(s => {
>   Future {
> val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
>   }
> }).map(Await.result(_, Duration.Inf))
> 
> accum.add((System.nanoTime() - before) / 1e9)
> Iterator(1)
>   }.count()
>   println("get remote bytes take: " + accum.value/8)
>   i += 1
> }
>   }
> {code}
> In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while
> in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s
> However if fetch block in single thread, the gap is much smaller.
> spark1.6.2  get remote bytes: 0.21 s
> spark1.5.1  get remote bytes: 0.20 s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6

2016-10-15 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578987#comment-15578987
 ] 

ding commented on SPARK-17951:
--

I tried to call rdd.collect which internally called bm.getRemoteBytes in 
multiple threads. The data shows difference of time spend on block fetch is 
also around 0.1s between spark 1.5.1 and spark 1.6.2. And it can be 
consistently repro.

In our scenario, we called getRemoteBytes twice. If we use spark 1.6.2 it will 
bring additional 0.2s overhead and cause around 5 decrease of throughput. we 
would like to reduce the overhead as much as possible.

> BlockFetch with multiple threads slows down after spark 1.6
> ---
>
> Key: SPARK-17951
> URL: https://issues.apache.org/jira/browse/SPARK-17951
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.2
> Environment: cluster with 8 node, each node has 28 cores. 10Gb network
>Reporter: ding
>
> The following code demonstrates the issue:
> {code}
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName(s"BMTest")
> val size = 3344570
> val sc = new SparkContext(conf)
> val data = sc.parallelize(1 to 100, 8)
> var accum = sc.accumulator(0.0, "get remote bytes")
> var i = 0
> while(i < 91) {
>   accum = sc.accumulator(0.0, "get remote bytes")
>   val test = data.mapPartitionsWithIndex { (pid, iter) =>
> val N = size
> val bm = SparkEnv.get.blockManager
> val blockId = TaskResultBlockId(10*i + pid)
> val test = new Array[Byte](N)
> Random.nextBytes(test)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> buffer.put(test)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
> Iterator(1)
>   }.count()
>   
>   data.mapPartitionsWithIndex { (pid, iter) =>
> val before = System.nanoTime()
> 
> val bm = SparkEnv.get.blockManager
> (0 to 7).map(s => {
>   Future {
> val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
>   }
> }).map(Await.result(_, Duration.Inf))
> 
> accum.add((System.nanoTime() - before) / 1e9)
> Iterator(1)
>   }.count()
>   println("get remote bytes take: " + accum.value/8)
>   i += 1
> }
>   }
> {code}
> In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while
> in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s
> However if fetch block in single thread, the gap is much smaller.
> spark1.6.2  get remote bytes: 0.21 s
> spark1.5.1  get remote bytes: 0.20 s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578914#comment-15578914
 ] 

Lars Francke edited comment on SPARK-650 at 10/15/16 11:22 PM:
---

I can only come up with three reasons at the moment. I hope they all make sense.

1) Singletons/Static Initialisers run once per Class Loader where this class is 
being loaded/used. I haven't actually seen this being a problem (and it might 
actually be desired behaviour in this case) but making the init step explicit 
would prevent this from ever becoming one.
2) I'd like to fail fast for some things and not upon first access (which might 
be behind a conditional somewhere)
3) It is hard enough to reason about where some piece of code is running as it 
is (Driver or Task/Executor), adding Singletons to the mix makes it even more 
confusing.


Thank you for reopening!


was (Author: lars_francke):
I can only come up with three reasons at the moment. I hope they all make sense.

1) Singletons/Static Initialisers run once per Class Loader where this class is 
being loaded/used. I haven't actually seen this being a problem (and it might 
actually be desired behaviour in this case) but making the init step explicit 
would prevent this from ever becoming one.
2) I'd like to fail fast for some things and not upon first access (which might 
be behind a conditional somewhere)
3) It is hard enough to reason about where some piece of code is running as it 
is (Driver or Task/Executor), adding Singletons to the mix makes it even more 
confusing.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578914#comment-15578914
 ] 

Lars Francke commented on SPARK-650:


I can only come up with three reasons at the moment. I hope they all make sense.

1) Singletons/Static Initialisers run once per Class Loader where this class is 
being loaded/used. I haven't actually seen this being a problem (and it might 
actually be desired behaviour in this case) but making the init step explicit 
would prevent this from ever becoming one.
2) I'd like to fail fast for some things and not upon first access (which might 
be behind a conditional somewhere)
3) It is hard enough to reason about where some piece of code is running as it 
is (Driver or Task/Executor), adding Singletons to the mix makes it even more 
confusing.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-15 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578894#comment-15578894
 ] 

Cody Koeninger commented on SPARK-17812:


As you just said yourself, assign doesn't mean you necessarily know the exact 
starting offsets you want.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-650:
-

As you wish, but, I disagree with this type of reasoning about JIRAs. I dont 
think anyone has addressed why a singleton isn't the answer. I can think of 
corner cases, but, that's why I suspect it isn't something that has needed 
implementing. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Lars Francke (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578875#comment-15578875
 ] 

Lars Francke commented on SPARK-650:


I also have to disagree with this being a duplicate or obsolete.

[~oarmand] and [~Skamandros] already mentioned reasons regarding the 
duplication.

About it being obsolete: I have seen multiple clients facing this problem, 
finding this issue and hoping it'd get fixed some day. I would hesitate a guess 
and say that most _users_ of Spark have no JIRA account here and do not 
register or log in just to vote for this issue. That said: This issue is (with 
six votes) in the top 150 out of almost 17k total issues in the Spark project.

As it happens this is a non-trivial thing to implement in Spark (as far as I 
can tell from my limited knowledge of the inner workings) so it's pretty hard 
for a "drive by" contributor to help here.

You had the discussion about community perception on the mailing list (re: 
Spark Improvement Proposals) and this issue happens to be one of those that at 
least I see popping up every once in a while in discussions with clients.

I would love to see this issue staying open as a feature request and have some 
hope that someone someday will implement it.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17865) R API for global temp view

2016-10-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-17865.
---
Resolution: Not A Problem

> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578835#comment-15578835
 ] 

Reynold Xin commented on SPARK-17865:
-

Alright in this case I don't think we need R API for now.


> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-15 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578723#comment-15578723
 ] 

Saikat Kanjilal commented on SPARK-9487:


Synched the code, am familiarizing myself first with how to run unit tests and 
work in the code , [~srowen][~holdenk], next steps will be to run the unit 
tests and report the results here, stay tuned.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-15 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578699#comment-15578699
 ] 

Ofir Manor commented on SPARK-17812:


I think Michael suggest that if you use {{startingOffsets}} without using 
{assign}}, that could work (use the topic-partition list from the 
startingOffsets), and would simplify the user coding (not needing to specify 
two similar lists, simpler resume etc).
You could keep an explicit {{assign}} for the "more rare?" cases, if someone 
wants to specify a list of topic-partitions but also earliest / latest / 
timestamp.

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cody Koeninger
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions
> currently agreed on plan:
> Mutually exclusive subscription options (only assign is new to this ticket)
> {noformat}
> .option("subscribe","topicFoo,topicBar")
> .option("subscribePattern","topic.*")
> .option("assign","""{"topicfoo": [0, 1],"topicbar": [0, 1]}""")
> {noformat}
> where assign can only be specified that way, no inline offsets
> Single starting position option with three mutually exclusive types of value
> {noformat}
> .option("startingOffsets", "earliest" | "latest" | """{"topicFoo": {"0": 
> 1234, "1": -2}, "topicBar":{"0": -1}}""")
> {noformat}
> startingOffsets with json fails if any topicpartition in the assignments 
> doesn't have an offset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17865) R API for global temp view

2016-10-15 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578639#comment-15578639
 ] 

Felix Cheung commented on SPARK-17865:
--

I see. So then we could either omit the support for global session/SharedState 
in R since there is only one, or, alternatively build support for it.

For the latter, I think a simple approach would be to add a flag for 
sparkR.session.stop(keep.global.session = TRUE) in which case the SparkContent 
and global session will not be stopped and will be reused when a new session is 
initialized. With that we could then add methods for global temp view and so on.

Do we believe there will be more features around global session in the future? 
If so there might be good reason to support this in R.


> R API for global temp view
> --
>
> Key: SPARK-17865
> URL: https://issues.apache.org/jira/browse/SPARK-17865
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> We need to add the R API for managing global temp views, mirroring the 
> changes in SPARK-17338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-15 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578607#comment-15578607
 ] 

Hossein Falaki commented on SPARK-17878:


I think moving it to another ticket is a good idea. One concern is that the API 
would be different between Scala and DDL.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578542#comment-15578542
 ] 

Sean Owen commented on SPARK-17954:
---

Lots of things changed. I think you need to first confirm connectivity to that 
hostname from the host in question to rule out more basic env problems. It may 
be a domain name vs IP problem.

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.net.ConnectException: Connection refused: 
> worker1.test/x.x.x.x:51029
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> 

[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-15 Thread Vitaly Gerasimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578539#comment-15578539
 ] 

Vitaly Gerasimov commented on SPARK-17954:
--

I don't think so. Spark 1.6 works fine in this case. May be something was 
changed in Spark 2.0 configuration?

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.net.ConnectException: Connection refused: 
> worker1.test/x.x.x.x:51029
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> 

[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578535#comment-15578535
 ] 

Sean Owen commented on SPARK-650:
-

If you need init to happen ASAP when the driver starts, isn't any similar 
mechanism going to be about the same in this regard? This cost is paid just 
once, and I don't think in general startup is very low latency for any Spark 
app.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Olivier Armand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578487#comment-15578487
 ] 

Olivier Armand commented on SPARK-650:
--

Sean, a singleton is not the best option in our case. The Spark Streaming 
executors are writing to HBase, we need to initialize the HBase connection. The 
singleton seems (or seemed when we tested it for our customer a few months 
after this issue was raised) to be created when the first RDD is processed by 
the executor, and not when the driver starts. This imposes very high processing 
time for the first events.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578378#comment-15578378
 ] 

Sean Owen commented on SPARK-650:
-

Sorry, I mean the _status_ doesn't matter. Most issues this old are obsolete or 
de facto won't-fix. Resolving it or not doesn't matter.

I would even say this is 'not a problem', because a simple singleton provides 
once-per-executor execution of whatever you like. It's more complex to make a 
custom mechanism that makes you route this via Spark. That's probably way this 
hasn't proved necessary.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578361#comment-15578361
 ] 

Michael Schmeißer commented on SPARK-650:
-

Then somebody should please explain to me, how this doesn't matter or rather 
how certain use-cases are supposed to be solved. We need to initialize each JVM 
and connect it to our logging system, set correlation IDs, initialize contexts 
and so on. I guess that most users just have implemented work-arounds as we 
did, but in an enterprise environment, this is really not the preferable 
long-term solution to me. Plus, I think that it would really not be hard to 
implement this feature for someone who has knowledge about the Spark executor 
setup.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578338#comment-15578338
 ] 

Sean Owen commented on SPARK-650:
-

In practice, these should probably all be WontFix as it hasn't mattered enough 
to implement in almost 4 years. It really doesn't matter.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17892:


Assignee: Xiao Li  (was: Apache Spark)

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578300#comment-15578300
 ] 

Apache Spark commented on SPARK-17892:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15502

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17892:


Assignee: Apache Spark  (was: Xiao Li)

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-636) Add mechanism to run system management/configuration tasks on all workers

2016-10-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578294#comment-15578294
 ] 

Michael Schmeißer commented on SPARK-636:
-

I agree, that's why I also feel that these issues are no duplicates. 

> Add mechanism to run system management/configuration tasks on all workers
> -
>
> Key: SPARK-636
> URL: https://issues.apache.org/jira/browse/SPARK-636
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Josh Rosen
>
> It would be useful to have a mechanism to run a task on all workers in order 
> to perform system management tasks, such as purging caches or changing system 
> properties.  This is useful for automated experiments and benchmarking; I 
> don't envision this being used for heavy computation.
> Right now, I can mimic this with something like
> {code}
> sc.parallelize(0 until numMachines, numMachines).foreach { } 
> {code}
> but this does not guarantee that every worker runs a task and requires my 
> user code to know the number of workers.
> One sample use case is setup and teardown for benchmark tests.  For example, 
> I might want to drop cached RDDs, purge shuffle data, and call 
> {{System.gc()}} between test runs.  It makes sense to incorporate some of 
> this functionality, such as dropping cached RDDs, into Spark itself, but it 
> might be helpful to have a general mechanism for running ad-hoc tasks like 
> {{System.gc()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578289#comment-15578289
 ] 

Michael Schmeißer commented on SPARK-650:
-

I disagree that those issues are duplicates. Spark-636 looks for a generic way 
to execute code on the Executors, but not for a reliable and easy mechanism to 
execute code during Executor initialization.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-10-15 Thread Ashish Shrowty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578200#comment-15578200
 ] 

Ashish Shrowty commented on SPARK-17709:


Oh .. sorry .. I misread. Will try with 2.0.1 later

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>Priority: Critical
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-15 Thread zhangxinyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578176#comment-15578176
 ] 

zhangxinyu commented on SPARK-17935:


I will try it on 0.10 in the future. But as I know most of users are still 
using kafka 0.8, since it's stable. So IMO the priority of 0.8 is higher than 
0.10.

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578155#comment-15578155
 ] 

Sean Owen commented on SPARK-17954:
---

Is this not just a networking  or IP/hostname problem? it says it can't connect.

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.net.ConnectException: Connection refused: 
> worker1.test/x.x.x.x:51029
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> 

[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-15 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578141#comment-15578141
 ] 

Cody Koeninger commented on SPARK-17935:


Why is this in kafka-0-8, when we haven't resolved (for the third time) whether 
we're even continuing to work on 0.8?  Those modules have conflicting 
dependencies as is.

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-15 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578137#comment-15578137
 ] 

Susan X. Huynh commented on SPARK-11524:


I would like to work on this. What are the missing pieces? What configurations 
need to be tested? [~sunrui][~felixcheung]

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17956) ProjectExec has incorrect outputOrdering property

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17956:


Assignee: Apache Spark

> ProjectExec has incorrect outputOrdering property
> -
>
> Key: SPARK-17956
> URL: https://issues.apache.org/jira/browse/SPARK-17956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently ProjectExec simply takes child plan's outputOrdering as its 
> outputOrdering. In some cases, this leads to incorrect outputOrdering. This 
> applies to TakeOrderedAndProjectExec too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17956) ProjectExec has incorrect outputOrdering property

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17956:


Assignee: (was: Apache Spark)

> ProjectExec has incorrect outputOrdering property
> -
>
> Key: SPARK-17956
> URL: https://issues.apache.org/jira/browse/SPARK-17956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently ProjectExec simply takes child plan's outputOrdering as its 
> outputOrdering. In some cases, this leads to incorrect outputOrdering. This 
> applies to TakeOrderedAndProjectExec too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17956) ProjectExec has incorrect outputOrdering property

2016-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578136#comment-15578136
 ] 

Apache Spark commented on SPARK-17956:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15500

> ProjectExec has incorrect outputOrdering property
> -
>
> Key: SPARK-17956
> URL: https://issues.apache.org/jira/browse/SPARK-17956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently ProjectExec simply takes child plan's outputOrdering as its 
> outputOrdering. In some cases, this leads to incorrect outputOrdering. This 
> applies to TakeOrderedAndProjectExec too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17938) Backpressure rate not adjusting

2016-10-15 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578133#comment-15578133
 ] 

Cody Koeninger commented on SPARK-17938:


There was pretty extensive discussion of this on list, should link or summarize 
it.

Couple of things here:

 100 is the default minimum rate for pidestimator.  If you're willing to write 
code, put more logging in to determine why that rate isn't being configured, or 
hardcode it to a different number. I have successfully adjusted that rate using 
spark configuration.

The other thing is that if your system takes way longer than 1 second to 
process 100k records, 100k obviously isn't a reasonable max. Many large batches 
will be defined during the time that first batch is running, before back 
pressure is involved at all. Try a lower max.

> Backpressure rate not adjusting
> ---
>
> Key: SPARK-17938
> URL: https://issues.apache.org/jira/browse/SPARK-17938
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Samy Dindane
>
> spark-streaming 2.0.1 and spark-streaming-kafka-0-10 version is 2.0.1. Same 
> behavior with 2.0.0 though.
> spark.streaming.kafka.consumer.poll.ms is set to 3
> spark.streaming.kafka.maxRatePerPartition is set to 10
> spark.streaming.backpressure.enabled is set to true
> `batchDuration` of the streaming context is set to 1 second.
> I consume a Kafka topic using KafkaUtils.createDirectStream().
> My system can handle 100k records batches, but it'd take more than 1 seconds 
> to process them all. I'd thus expect the backpressure to reduce the number of 
> records that would be fetched in the next batch to keep the processing delay 
> inferior to 1 second.
> Only this does not happen and the rate of the backpressure stays the same: 
> stuck in `100.0`, no matter how the other variables change (processing time, 
> error, etc.).
> Here's a log showing how all these variables change but the chosen rate stays 
> the same: https://gist.github.com/Dinduks/d9fa67fc8a036d3cad8e859c508acdba (I 
> would have attached a file but I don't see how).
> Is this the expected behavior and I am missing something, or is this  a bug?
> I'll gladly help by providing more information or writing code if necessary.
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17956) ProjectExec has incorrect outputOrdering property

2016-10-15 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-17956:
---

 Summary: ProjectExec has incorrect outputOrdering property
 Key: SPARK-17956
 URL: https://issues.apache.org/jira/browse/SPARK-17956
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently ProjectExec simply takes child plan's outputOrdering as its 
outputOrdering. In some cases, this leads to incorrect outputOrdering. This 
applies to TakeOrderedAndProjectExec too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17838) Strict type checking for arguments with a better messages across APIs.

2016-10-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578101#comment-15578101
 ] 

Hyukjin Kwon commented on SPARK-17838:
--

I am trying to work on this but I don't mind if anyone takes over this. It 
seems take a bit of time to me.

> Strict type checking for arguments with a better messages across APIs.
> --
>
> Key: SPARK-17838
> URL: https://issues.apache.org/jira/browse/SPARK-17838
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Hyukjin Kwon
>
> It seems there should be more strict type checking for arguments in SparkR 
> APIs. This was discussed in several PRs. 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> Roughly it seems there are three cases as below:
> The first case below was described in 
> https://github.com/apache/spark/pull/15239#discussion_r82445435
> - Check for {{zero-length variable name}}
> Some of other cases below were handled in 
> https://github.com/apache/spark/pull/15231#discussion_r80417904
> - Catch the exception from JVM and format it as pretty
> - Check strictly types before calling JVM in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578098#comment-15578098
 ] 

Hyukjin Kwon commented on SPARK-17878:
--

Yes, this might not be a reason to block this feature.
It seems not easy and needs to change a lot. Maybe, how about, for example, 
adding {{option(value: List[String])}} and then converting it into JSON array?
Actually, I can move this one into another JIRA as it might not be the reason 
to block this JIRA but just want to hear your thought before I will try.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17945) Writing to S3 should allow setting object metadata

2016-10-15 Thread Jeff Schobelock (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578064#comment-15578064
 ] 

Jeff Schobelock commented on SPARK-17945:
-

That's true enough. The use case on my end is that we use s3 object Metadata to 
get things like file delimiters, etc when we process files.

> Writing to S3 should allow setting object metadata
> --
>
> Key: SPARK-17945
> URL: https://issues.apache.org/jira/browse/SPARK-17945
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Jeff Schobelock
>Priority: Minor
>
> I can't find any possible way to use Spark to write to S3 and set user object 
> metadata. This seems like such a simple thing that I feel I must be missing 
> somewhere how to do itbut I have yet to find anything.
> I don't know what all work adding this would entail. My idea would be that 
> there is something like:
> rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map 
> data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17955:


Assignee: Apache Spark

> Use the same read path in DataFrameReader.jdbc and 
> DataFrameReader.format("jdbc") 
> --
>
> Key: SPARK-17955
> URL: https://issues.apache.org/jira/browse/SPARK-17955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share 
> {{format("...").load()}} or {{format("...").save()}} APIs for 
> {{json(...)}}/{{csv(...)}} and etc.
> We can share this within {{DataFrameReader.jdbc(...)}} too consistently with 
> other APIs.
> {code}
> -// connectionProperties should override settings in extraOptions.
> -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
> -val options = new JDBCOptions(url, table, params)
> -val relation = JDBCRelation(parts, options)(sparkSession)
> -sparkSession.baseRelationToDataFrame(relation)
> +// connectionProperties should override settings in extraOptions
> +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala)
> +// explicit url and dbtable should override all
> +this.extraOptions += ("url" -> url, "dbtable" -> table)
> +format("jdbc").load()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17955:


Assignee: (was: Apache Spark)

> Use the same read path in DataFrameReader.jdbc and 
> DataFrameReader.format("jdbc") 
> --
>
> Key: SPARK-17955
> URL: https://issues.apache.org/jira/browse/SPARK-17955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share 
> {{format("...").load()}} or {{format("...").save()}} APIs for 
> {{json(...)}}/{{csv(...)}} and etc.
> We can share this within {{DataFrameReader.jdbc(...)}} too consistently with 
> other APIs.
> {code}
> -// connectionProperties should override settings in extraOptions.
> -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
> -val options = new JDBCOptions(url, table, params)
> -val relation = JDBCRelation(parts, options)(sparkSession)
> -sparkSession.baseRelationToDataFrame(relation)
> +// connectionProperties should override settings in extraOptions
> +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala)
> +// explicit url and dbtable should override all
> +this.extraOptions += ("url" -> url, "dbtable" -> table)
> +format("jdbc").load()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")

2016-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578038#comment-15578038
 ] 

Apache Spark commented on SPARK-17955:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15499

> Use the same read path in DataFrameReader.jdbc and 
> DataFrameReader.format("jdbc") 
> --
>
> Key: SPARK-17955
> URL: https://issues.apache.org/jira/browse/SPARK-17955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share 
> {{format("...").load()}} or {{format("...").save()}} APIs for 
> {{json(...)}}/{{csv(...)}} and etc.
> We can share this within {{DataFrameReader.jdbc(...)}} too consistently with 
> other APIs.
> {code}
> -// connectionProperties should override settings in extraOptions.
> -val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
> -val options = new JDBCOptions(url, table, params)
> -val relation = JDBCRelation(parts, options)(sparkSession)
> -sparkSession.baseRelationToDataFrame(relation)
> +// connectionProperties should override settings in extraOptions
> +this.extraOptions = this.extraOptions ++ (connectionProperties.asScala)
> +// explicit url and dbtable should override all
> +this.extraOptions += ("url" -> url, "dbtable" -> table)
> +format("jdbc").load()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17955) Use the same read path in DataFrameReader.jdbc and DataFrameReader.format("jdbc")

2016-10-15 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17955:


 Summary: Use the same read path in DataFrameReader.jdbc and 
DataFrameReader.format("jdbc") 
 Key: SPARK-17955
 URL: https://issues.apache.org/jira/browse/SPARK-17955
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon
Priority: Trivial


It seems APIs in {{DataFrameReader}}/{{DataFrameWriter}} share 
{{format("...").load()}} or {{format("...").save()}} APIs for 
{{json(...)}}/{{csv(...)}} and etc.

We can share this within {{DataFrameReader.jdbc(...)}} too consistently with 
other APIs.

{code}
-// connectionProperties should override settings in extraOptions.
-val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
-val options = new JDBCOptions(url, table, params)
-val relation = JDBCRelation(parts, options)(sparkSession)
-sparkSession.baseRelationToDataFrame(relation)
+// connectionProperties should override settings in extraOptions
+this.extraOptions = this.extraOptions ++ (connectionProperties.asScala)
+// explicit url and dbtable should override all
+this.extraOptions += ("url" -> url, "dbtable" -> table)
+format("jdbc").load()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14428) [SQL] Allow more flexibility when parsing dates and timestamps in json datasources

2016-10-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577994#comment-15577994
 ] 

Hyukjin Kwon commented on SPARK-14428:
--

For 1. I guess this was fixed in https://github.com/apache/spark/pull/14279 so 
we should define the format for read/write.

> [SQL] Allow more flexibility when parsing dates and timestamps in json 
> datasources
> --
>
> Key: SPARK-14428
> URL: https://issues.apache.org/jira/browse/SPARK-14428
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: date, features, json, timestamp
>
> Reading a json with dates and timestamps is limited to predetermined string 
> formats or long values.
> 1) Should be able to set an option on json datasource to parse dates and 
> timestamps using custom string format.
> 2) Should be able to change the interpretation of long values since epoch.  
> It could support different precisions like days, seconds, milliseconds, 
> microseconds and nanoseconds.  
> Something in the lines of :
> {code}
> object Precision extends Enumeration {
> val days, seconds, milliseconds, microseconds, nanoseconds = Value
>   }
> def convertWithPrecision(time: Long, from: Precision.Value, to: 
> Precision.Value): Long = ...
> ...
>   val dateFormat = parameters.getOrElse("dateFormat", "").trim
>   val timestampFormat = parameters.getOrElse("timestampFormat", "").trim
>   val longDatePrecision = getOrElse("longDatePrecision", "days")
>   val longTimestampPrecision = getOrElse("longTimestampPrecision", 
> "milliseconds")
> {code}
> and 
> {code}
>   case (VALUE_STRING, DateType) =>
> val stringValue = parser.getText
> val days = if (configOptions.dateFormat.nonEmpty) {
>   // User defined format, make sure it complies to the SQL DATE 
> format (number of days)
>   val sdf = new SimpleDateFormat(configOptions.dateFormat) // Not 
> thread safe.
>   DateTimeUtils.convertWithPrecision(sdf.parse(stringValue).getTime, 
> Precision.milliseconds, Precision.days)
> } else if (stringValue.forall(_.isDigit)) {
>   DateTimeUtils.convertWithPrecision(stringValue.toLong, 
> configOptions.longDatePrecision, Precision.days)
> } else {
>   // The format of this string will probably be "-mm-dd".
>   
> DateTimeUtils.convertWithPrecision(DateTimeUtils.stringToTime(parser.getText).getTime,
>  Precision.milliseconds, Precision.days)
> }
> days.toInt
>   case (VALUE_NUMBER_INT, DateType) =>
>   DateTimeUtils.convertWithPrecision((parser.getLongValue, 
> configOptions.longDatePrecision, Precision.days).toInt
> {code}
> With similar handling for Timestamps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-15 Thread Vitaly Gerasimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Gerasimov updated SPARK-17954:
-
Issue Type: Bug  (was: Question)

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.net.ConnectException: Connection refused: 
> worker1.test/x.x.x.x:51029
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> 

[jira] [Created] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-15 Thread Vitaly Gerasimov (JIRA)
Vitaly Gerasimov created SPARK-17954:


 Summary: FetchFailedException executor cannot connect to another 
worker executor
 Key: SPARK-17954
 URL: https://issues.apache.org/jira/browse/SPARK-17954
 Project: Spark
  Issue Type: Question
Affects Versions: 2.0.1, 2.0.0
Reporter: Vitaly Gerasimov


I have standalone mode spark cluster wich have three nodes:
master.test
worker1.test
worker2.test

I am trying to run the next code in spark shell:
{code}
val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
"hdfs://master.test/json/b.js.gz")
json.createOrReplaceTempView("messages")
spark.sql("select count(*) from messages").show()
{code}

and I am getting the following exception:
{code}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
worker1.test/x.x.x.x:51029
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to worker1.test/x.x.x.x:51029
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: 
worker1.test/x.x.x.x:51029
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 

[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577852#comment-15577852
 ] 

Sean Owen commented on SPARK-17930:
---

Maybe so; I think the question is whether this would cause it to be used by 
multiple threads at once. Do you have any rough benchmarks that suggest this is 
a bottleneck?

> The SerializerInstance instance used when deserializing a TaskResult is not 
> reused 
> ---
>
> Key: SPARK-17930
> URL: https://issues.apache.org/jira/browse/SPARK-17930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
>Reporter: Guoqiang Li
>
> The following code is called when the DirectTaskResult instance is 
> deserialized
> {noformat}
>   def value(): T = {
> if (valueObjectDeserialized) {
>   valueObject
> } else {
>   // Each deserialization creates a new instance of SerializerInstance, 
> which is very time-consuming
>   val resultSer = SparkEnv.get.serializer.newInstance()
>   valueObject = resultSer.deserialize(valueBytes)
>   valueObjectDeserialized = true
>   valueObject
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17936) "CodeGenerator - failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of" method Error

2016-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17936.
---
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 2.1.0

OK, if you're pretty confident that also happened to fix this issue, that's 
great. It may even resolve the other JIRA I linked.

Evidently resolved by https://github.com/apache/spark/pull/15275

> "CodeGenerator - failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of" method Error
> -
>
> Key: SPARK-17936
> URL: https://issues.apache.org/jira/browse/SPARK-17936
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Justin Miller
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> Greetings. I'm currently in the process of migrating a project I'm working on 
> from Spark 1.6.2 to 2.0.1. The project uses Spark Streaming to convert Thrift 
> structs coming from Kafka into Parquet files stored in S3. This conversion 
> process works fine in 1.6.2 but I think there may be a bug in 2.0.1. I'll 
> paste the stack trace below.
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass;[Ljava/lang/Object;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
>   at org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
>   at org.codehaus.janino.UnitCompiler.writeLdc(UnitCompiler.java:9058)
> Also, later on:
> 07:35:30.191 ERROR o.a.s.u.SparkUncaughtExceptionHandler - Uncaught exception 
> in thread Thread[Executor task launch worker-6,5,run-main-group-0]
> java.lang.OutOfMemoryError: Java heap space
> I've seen similar issues posted, but those were always on the query side. I 
> have a hunch that this is happening at write time as the error occurs after 
> batchDuration. Here's the write snippet.
> stream.
>   flatMap {
> case Success(row) =>
>   thriftParseSuccess += 1
>   Some(row)
> case Failure(ex) =>
>   thriftParseErrors += 1
>   logger.error("Error during deserialization: ", ex)
>   None
>   }.foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.context)
> transformer(sqlContext.createDataFrame(rdd, converter.schema))
>   .coalesce(coalesceSize)
>   .write
>   .mode(Append)
>   .partitionBy(partitioning: _*)
>   .parquet(parquetPath)
>   }
> Please let me know if you can be of assistance and if there's anything I can 
> do to help.
> Best,
> Justin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577832#comment-15577832
 ] 

Sean Owen commented on SPARK-14887:
---

Possibly resolved by https://issues.apache.org/jira/browse/SPARK-17702 given 
discussion in https://issues.apache.org/jira/browse/SPARK-17936 but I'm not 
sure.

> Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
> ---
>
> Key: SPARK-14887
> URL: https://issues.apache.org/jira/browse/SPARK-14887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: fang fang chen
>Assignee: Davies Liu
>
> Similiar issue with SPARK-14138 and SPARK-8443:
> With large sql syntax(673K), following error happened:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17944) sbin/start-* scripts use of `hostname -f` fail with Solaris

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577826#comment-15577826
 ] 

Sean Owen commented on SPARK-17944:
---

Yes, for example `hostname` differs on OS X / Linux but the behavior of `-f` 
happens to give the same desired output on both. Is `check-hostname` present 
only on Solaris (or at least, does it do the right thing everywhere we think it 
exists)? If so I could imagine just checking if that command exists and using 
it if present.

There are 3 usages, and all could have the same treatment. That's not so bad 
IMHO. A central script to switch out commands seems appealing but the 
indirection non-trivially complicates the scripts. I'm not sure how far to go 
to try to support Solaris, not because there's anything wrong with it, but 
because a) I don't know how many Spark users are on Solaris, and b) I am not 
sure we know it otherwise works, that there aren't other problems like this

> sbin/start-* scripts use of `hostname -f` fail with Solaris 
> 
>
> Key: SPARK-17944
> URL: https://issues.apache.org/jira/browse/SPARK-17944
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Solaris 10, Solaris 11
>Reporter: Erik O'Shaughnessy
>Priority: Trivial
>
> {{$SPARK_HOME/sbin/start-master.sh}} fails:
> {noformat}
> $ ./start-master.sh 
> usage: hostname [[-t] system_name]
>hostname [-D]
> starting org.apache.spark.deploy.master.Master, logging to 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> failed to launch org.apache.spark.deploy.master.Master:
> --properties-file FILE Path to a custom Spark properties file.
>Default is conf/spark-defaults.conf.
> full log in 
> /home/eoshaugh/local/spark/logs/spark-eoshaugh-org.apache.spark.deploy.master.Master-1-m7-16-002-ld1.out
> {noformat}
> I found SPARK-17546 which changed the invocation of hostname in 
> sbin/start-master.sh, sbin/start-slaves.sh and sbin/start-mesos-dispatcher.sh 
> to include the flag {{-f}}, which is not a valid command line option for the 
> Solaris hostname implementation. 
> As a workaround, Solaris users can substitute:
> {noformat}
> `/usr/sbin/check-hostname | awk '{print $NF}'`
> {noformat}
> Admittedly not an obvious fix, but it provides equivalent functionality. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577637#comment-15577637
 ] 

Sean Owen commented on SPARK-17951:
---

I'm not sure this suggests a problem with Spark. It's a very small difference 
in absolute terms and may say more about the benchmark than Spark. It's also an 
internal API. Do you have a bigger difference in calls to a user-facing API?

> BlockFetch with multiple threads slows down after spark 1.6
> ---
>
> Key: SPARK-17951
> URL: https://issues.apache.org/jira/browse/SPARK-17951
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.2
> Environment: cluster with 8 node, each node has 28 cores. 10Gb network
>Reporter: ding
>
> The following code demonstrates the issue:
> {code}
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName(s"BMTest")
> val size = 3344570
> val sc = new SparkContext(conf)
> val data = sc.parallelize(1 to 100, 8)
> var accum = sc.accumulator(0.0, "get remote bytes")
> var i = 0
> while(i < 91) {
>   accum = sc.accumulator(0.0, "get remote bytes")
>   val test = data.mapPartitionsWithIndex { (pid, iter) =>
> val N = size
> val bm = SparkEnv.get.blockManager
> val blockId = TaskResultBlockId(10*i + pid)
> val test = new Array[Byte](N)
> Random.nextBytes(test)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> buffer.put(test)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
> Iterator(1)
>   }.count()
>   
>   data.mapPartitionsWithIndex { (pid, iter) =>
> val before = System.nanoTime()
> 
> val bm = SparkEnv.get.blockManager
> (0 to 7).map(s => {
>   Future {
> val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
>   }
> }).map(Await.result(_, Duration.Inf))
> 
> accum.add((System.nanoTime() - before) / 1e9)
> Iterator(1)
>   }.count()
>   println("get remote bytes take: " + accum.value/8)
>   i += 1
> }
>   }
> {code}
> In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while
> in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s
> However if fetch block in single thread, the gap is much smaller.
> spark1.6.2  get remote bytes: 0.21 s
> spark1.5.1  get remote bytes: 0.20 s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17951) BlockFetch with multiple threads slows down after spark 1.6

2016-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17951:
--
Description: 
The following code demonstrates the issue:

{code}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(s"BMTest")
val size = 3344570
val sc = new SparkContext(conf)

val data = sc.parallelize(1 to 100, 8)
var accum = sc.accumulator(0.0, "get remote bytes")
var i = 0
while(i < 91) {
  accum = sc.accumulator(0.0, "get remote bytes")
  val test = data.mapPartitionsWithIndex { (pid, iter) =>
val N = size
val bm = SparkEnv.get.blockManager
val blockId = TaskResultBlockId(10*i + pid)
val test = new Array[Byte](N)
Random.nextBytes(test)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
buffer.put(test)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
Iterator(1)
  }.count()
  
  data.mapPartitionsWithIndex { (pid, iter) =>
val before = System.nanoTime()

val bm = SparkEnv.get.blockManager
(0 to 7).map(s => {
  Future {
val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
  }
}).map(Await.result(_, Duration.Inf))

accum.add((System.nanoTime() - before) / 1e9)
Iterator(1)
  }.count()
  println("get remote bytes take: " + accum.value/8)
  i += 1
}
  }
{code}


In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while
in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s

However if fetch block in single thread, the gap is much smaller.
spark1.6.2  get remote bytes: 0.21 s
spark1.5.1  get remote bytes: 0.20 s



  was:
The following code demonstrates the issue:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName(s"BMTest")
val size = 3344570
val sc = new SparkContext(conf)

val data = sc.parallelize(1 to 100, 8)
var accum = sc.accumulator(0.0, "get remote bytes")
var i = 0
while(i < 91) {
  accum = sc.accumulator(0.0, "get remote bytes")
  val test = data.mapPartitionsWithIndex { (pid, iter) =>
val N = size
val bm = SparkEnv.get.blockManager
val blockId = TaskResultBlockId(10*i + pid)
val test = new Array[Byte](N)
Random.nextBytes(test)
val buffer = ByteBuffer.allocate(N)
buffer.limit(N)
buffer.put(test)
bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
Iterator(1)
  }.count()
  
  data.mapPartitionsWithIndex { (pid, iter) =>
val before = System.nanoTime()

val bm = SparkEnv.get.blockManager
(0 to 7).map(s => {
  Future {
val result = bm.getRemoteBytes(TaskResultBlockId(10*i + s))
  }
}).map(Await.result(_, Duration.Inf))

accum.add((System.nanoTime() - before) / 1e9)
Iterator(1)
  }.count()
  println("get remote bytes take: " + accum.value/8)
  i += 1
}
  }

In spark1.6.2, average of "getting remote bytes" time is: 0.19 s while
in spark 1.5.1 average of "getting remote bytes" time is: 0.09 s

However if fetch block in single thread, the gap is much smaller.
spark1.6.2  get remote bytes: 0.21 s
spark1.5.1  get remote bytes: 0.20 s




> BlockFetch with multiple threads slows down after spark 1.6
> ---
>
> Key: SPARK-17951
> URL: https://issues.apache.org/jira/browse/SPARK-17951
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.2
> Environment: cluster with 8 node, each node has 28 cores. 10Gb network
>Reporter: ding
>
> The following code demonstrates the issue:
> {code}
> def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName(s"BMTest")
> val size = 3344570
> val sc = new SparkContext(conf)
> val data = sc.parallelize(1 to 100, 8)
> var accum = sc.accumulator(0.0, "get remote bytes")
> var i = 0
> while(i < 91) {
>   accum = sc.accumulator(0.0, "get remote bytes")
>   val test = data.mapPartitionsWithIndex { (pid, iter) =>
> val N = size
> val bm = SparkEnv.get.blockManager
> val blockId = TaskResultBlockId(10*i + pid)
> val test = new Array[Byte](N)
> Random.nextBytes(test)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> buffer.put(test)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_ONLY_SER)
> Iterator(1)
>   }.count()
>   
>   data.mapPartitionsWithIndex { (pid, iter) =>
> val before = System.nanoTime()
> 
>  

[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16845:

Target Version/s: 2.1.0

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17953) Fix typo in SparkSession scaladoc

2016-10-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17953:
--
Priority: Trivial  (was: Minor)

> Fix typo in SparkSession scaladoc
> -
>
> Key: SPARK-17953
> URL: https://issues.apache.org/jira/browse/SPARK-17953
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Tae Jun Kim
>Assignee: Tae Jun Kim
>Priority: Trivial
> Fix For: 2.0.2, 2.1.0
>
>
> At SparkSession builder example,
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value").
>  .getOrCreate()
> {code}
> should be
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value")
>  .getOrCreate()
> {code}
> There was one dot at the end of the third line
> {code}
>  .config("spark.some.config.option", "some-value").
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577596#comment-15577596
 ] 

Sean Owen commented on SPARK-17950:
---

Doesn't that make a potentially huge array every time this is called?

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Simply added the `__getattr__` to SparseVector that DenseVector has, but 
> calls self.toArray() instead of storing a vector all the time in self.array
> This allows for use of numpy functions on the values of a SparseVector in the 
> same direct way that users interact with DenseVectors.
>  i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17953) Fix typo in SparkSession scaladoc

2016-10-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17953.
-
   Resolution: Fixed
 Assignee: Tae Jun Kim
Fix Version/s: 2.1.0
   2.0.2

> Fix typo in SparkSession scaladoc
> -
>
> Key: SPARK-17953
> URL: https://issues.apache.org/jira/browse/SPARK-17953
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Tae Jun Kim
>Assignee: Tae Jun Kim
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> At SparkSession builder example,
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value").
>  .getOrCreate()
> {code}
> should be
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value")
>  .getOrCreate()
> {code}
> There was one dot at the end of the third line
> {code}
>  .config("spark.some.config.option", "some-value").
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17945) Writing to S3 should allow setting object metadata

2016-10-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577593#comment-15577593
 ] 

Sean Owen commented on SPARK-17945:
---

You can just set this with the S3 APIs directly? this borders or something it's 
not worth Spark passing-through just for S3 unless there's a common use case 
for it and possibly other FSes

> Writing to S3 should allow setting object metadata
> --
>
> Key: SPARK-17945
> URL: https://issues.apache.org/jira/browse/SPARK-17945
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Jeff Schobelock
>Priority: Minor
>
> I can't find any possible way to use Spark to write to S3 and set user object 
> metadata. This seems like such a simple thing that I feel I must be missing 
> somewhere how to do itbut I have yet to find anything.
> I don't know what all work adding this would entail. My idea would be that 
> there is something like:
> rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map 
> data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17953) Fix typo in SparkSession scaladoc

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17953:


Assignee: Apache Spark

> Fix typo in SparkSession scaladoc
> -
>
> Key: SPARK-17953
> URL: https://issues.apache.org/jira/browse/SPARK-17953
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Tae Jun Kim
>Assignee: Apache Spark
>Priority: Minor
>
> At SparkSession builder example,
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value").
>  .getOrCreate()
> {code}
> should be
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value")
>  .getOrCreate()
> {code}
> There was one dot at the end of the third line
> {code}
>  .config("spark.some.config.option", "some-value").
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17953) Fix typo in SparkSession scaladoc

2016-10-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577582#comment-15577582
 ] 

Apache Spark commented on SPARK-17953:
--

User 'tae-jun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15498

> Fix typo in SparkSession scaladoc
> -
>
> Key: SPARK-17953
> URL: https://issues.apache.org/jira/browse/SPARK-17953
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Tae Jun Kim
>Priority: Minor
>
> At SparkSession builder example,
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value").
>  .getOrCreate()
> {code}
> should be
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value")
>  .getOrCreate()
> {code}
> There was one dot at the end of the third line
> {code}
>  .config("spark.some.config.option", "some-value").
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17953) Fix typo in SparkSession scaladoc

2016-10-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17953:


Assignee: (was: Apache Spark)

> Fix typo in SparkSession scaladoc
> -
>
> Key: SPARK-17953
> URL: https://issues.apache.org/jira/browse/SPARK-17953
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Tae Jun Kim
>Priority: Minor
>
> At SparkSession builder example,
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value").
>  .getOrCreate()
> {code}
> should be
> {code}
> SparkSession.builder()
>  .master("local")
>  .appName("Word Count")
>  .config("spark.some.config.option", "some-value")
>  .getOrCreate()
> {code}
> There was one dot at the end of the third line
> {code}
>  .config("spark.some.config.option", "some-value").
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Summary: Java SparkSession createDataFrame method throws exception with 
nested Javabean  (was: Java SparkSession createDataFrame doesn't work with 
nested Javabean)

> Java SparkSession createDataFrame method throws exception with nested Javabean
> --
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Baghel
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  createDataFrame method supports nested JavaBeans. However this is not 
> working. Please see below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
>   at 
> 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame method throws exception for nested JavaBeans

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Summary: Java SparkSession createDataFrame method throws exception for 
nested JavaBeans  (was: Java SparkSession createDataFrame method throws 
exception with nested Javabean)

> Java SparkSession createDataFrame method throws exception for nested JavaBeans
> --
>
> Key: SPARK-17952
> URL: https://issues.apache.org/jira/browse/SPARK-17952
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Baghel
>
> As per latest spark documentation for Java at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
>  createDataFrame method supports nested JavaBeans. However this is not 
> working. Please see below code.
> SubCategory class
> {code}
> public class SubCategory implements Serializable{
>   private String id;
>   private String name;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public String getName() {
>   return name;
>   }
>   public void setName(String name) {
>   this.name = name;
>   }   
> }
> {code}
> Category class
> {code}
> public class Category implements Serializable{
>   private String id;
>   private SubCategory subCategory;
>   
>   public String getId() {
>   return id;
>   }
>   public void setId(String id) {
>   this.id = id;
>   }
>   public SubCategory getSubCategory() {
>   return subCategory;
>   }
>   public void setSubCategory(SubCategory subCategory) {
>   this.subCategory = subCategory;
>   }
> }
> {code}
> SparkSample class
> {code}
> public class SparkSample {
>   public static void main(String[] args) throws IOException { 
> 
>   SparkSession spark = SparkSession
>   .builder()
>   .appName("SparkSample")
>   .master("local")
>   .getOrCreate();
>   //SubCategory
>   SubCategory sub = new SubCategory();
>   sub.setId("sc-111");
>   sub.setName("Sub-1");
>   //Category
>   Category category = new Category();
>   category.setId("s-111");
>   category.setSubCategory(sub);
>   //categoryList
>   List categoryList = new ArrayList();
>   categoryList.add(category);
>//DF
>   Dataset dframe = spark.createDataFrame(categoryList, 
> Category.class);  
>   dframe.show();  
>   }
> }
> {code}
> Above code throws below error.
> {code}
> Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d 
> (of class com.sample.SubCategory)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
>   at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
>   at 
> 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation for Java at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports nested JavaBeans. However this is not working. 
Please see below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample class

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above exception. But I observed that 
createDataset method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation at 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports nested JavaBeans. However this is not working. 
Please see below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample class

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above exception. But I observed that 
createDataset method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation at 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports nested JavaBeans. However this is not working. 
Please see below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample 

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above exception. But I observed that 
createDataset method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation at 

[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports nested JavaBeans. However this is not working. 
Please see below code.

SubCategory class

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

Category class

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

SparkSample 

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


{code}createDataFrame{code} method throws above exception. But I observed that 
{code}createDataset{code} method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation at 

[jira] [Created] (SPARK-17953) Fix typo in SparkSession scaladoc

2016-10-15 Thread Tae Jun Kim (JIRA)
Tae Jun Kim created SPARK-17953:
---

 Summary: Fix typo in SparkSession scaladoc
 Key: SPARK-17953
 URL: https://issues.apache.org/jira/browse/SPARK-17953
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Tae Jun Kim
Priority: Minor


At SparkSession builder example,
{code}
SparkSession.builder()
 .master("local")
 .appName("Word Count")
 .config("spark.some.config.option", "some-value").
 .getOrCreate()
{code}

should be

{code}
SparkSession.builder()
 .master("local")
 .appName("Word Count")
 .config("spark.some.config.option", "some-value")
 .getOrCreate()
{code}

There was one dot at the end of the third line
{code}
 .config("spark.some.config.option", "some-value").
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Baghel updated SPARK-17952:

Description: 
As per latest spark documentation at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports nested JavaBeans. However this is not working. 
Please see below code.

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}


{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}


{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}


Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}


createDataFrame method throws above error. But I observed that createDataset 
method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, encoder);
dframe.show();
{code}

  was:
As per latest spark documentation at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports 

[jira] [Created] (SPARK-17952) Java SparkSession createDataFrame doesn't work with nested Javabean

2016-10-15 Thread Amit Baghel (JIRA)
Amit Baghel created SPARK-17952:
---

 Summary: Java SparkSession createDataFrame doesn't work with 
nested Javabean
 Key: SPARK-17952
 URL: https://issues.apache.org/jira/browse/SPARK-17952
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1, 2.0.0
Reporter: Amit Baghel


As per latest spark documentation at 
http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection,
 createDataFrame method supports nested JavaBeans. However this is not working. 
Please see below code.

{code}
public class SubCategory implements Serializable{
private String id;
private String name;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}   
}

{code}

{code}
public class Category implements Serializable{
private String id;
private SubCategory subCategory;

public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public SubCategory getSubCategory() {
return subCategory;
}
public void setSubCategory(SubCategory subCategory) {
this.subCategory = subCategory;
}
}
{code}

{code}
public class SparkSample {
public static void main(String[] args) throws IOException { 

SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local")
.getOrCreate();
//SubCategory
SubCategory sub = new SubCategory();
sub.setId("sc-111");
sub.setName("Sub-1");
//Category
Category category = new Category();
category.setId("s-111");
category.setSubCategory(sub);
//categoryList
List categoryList = new ArrayList();
categoryList.add(category);
 //DF
Dataset dframe = spark.createDataFrame(categoryList, 
Category.class);  
dframe.show();  
}
}
{code}

Above code throws below error.

{code}
Exception in thread "main" scala.MatchError: com.sample.SubCategory@e7391d (of 
class com.sample.SubCategory)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:256)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:251)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:403)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1106)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1106)
at 
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1104)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.toStream(Iterator.scala:1322)
at scala.collection.AbstractIterator.toStream(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.toSeq(TraversableOnce.scala:298)
at scala.collection.AbstractIterator.toSeq(Iterator.scala:1336)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:373)
at com.sample.SparkSample.main(SparkSample.java:33)
{code}

createDataFrame method throws above error. But I observed that createDataset 
method works fine with below code.

{code}
Encoder encoder = Encoders.bean(Category.class); 
Dataset dframe = spark.createDataset(categoryList, 
encoder);
  

[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused

2016-10-15 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577506#comment-15577506
 ] 

Guoqiang Li commented on SPARK-17930:
-

If a stage contains a lot of tasks, eg one million tasks, the code here needs 
to create one million SerializerInstance instances, which seriously affects the 
performance of the DAG. At least we can reuse the SerializerInstance instance 
per stage.

> The SerializerInstance instance used when deserializing a TaskResult is not 
> reused 
> ---
>
> Key: SPARK-17930
> URL: https://issues.apache.org/jira/browse/SPARK-17930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
>Reporter: Guoqiang Li
>
> The following code is called when the DirectTaskResult instance is 
> deserialized
> {noformat}
>   def value(): T = {
> if (valueObjectDeserialized) {
>   valueObject
> } else {
>   // Each deserialization creates a new instance of SerializerInstance, 
> which is very time-consuming
>   val resultSer = SparkEnv.get.serializer.newInstance()
>   valueObject = resultSer.deserialize(valueBytes)
>   valueObjectDeserialized = true
>   valueObject
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17851) Make sure all test sqls in catalyst pass checkAnalysis

2016-10-15 Thread Jiang Xingbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-17851:
-
Description: Currently we have several tens of test sqls in catalyst will 
fail at `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid.  
(was: Currently we have several tens of test sqls in catalyst will fail at 
`SimpleAnalyzer.checkAnalyze`, we should make sure they are valid.)
Summary: Make sure all test sqls in catalyst pass checkAnalysis  (was: 
Make sure all test sqls in catalyst pass checkAnalyze)

> Make sure all test sqls in catalyst pass checkAnalysis
> --
>
> Key: SPARK-17851
> URL: https://issues.apache.org/jira/browse/SPARK-17851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Currently we have several tens of test sqls in catalyst will fail at 
> `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org