[jira] [Resolved] (SPARK-20980) Rename the option `wholeFile` to `multiLine` for JSON and CSV

2017-06-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20980.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18202
[https://github.com/apache/spark/pull/18202]

> Rename the option `wholeFile` to `multiLine` for JSON and CSV
> -
>
> Key: SPARK-20980
> URL: https://issues.apache.org/jira/browse/SPARK-20980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> The current option name `wholeFile` is misleading for CSV. Currently, it is 
> not representing a record per file. Actually, one file could have multiple 
> records. Thus, we should rename it. Now, the proposal is `multiLine`.
> To make it consistent, we need to rename the same option for JSON and fix the 
> issue in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18016.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18075
[https://github.com/apache/spark/pull/18075]

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(Sim

[jira] [Assigned] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18016:
---

Assignee: Aleksander Eskilson

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>   at 
> org.codehaus.ja

[jira] [Assigned] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19900:
---

Assignee: Li Yichao

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Assignee: Li Yichao
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
> Fix For: 2.3.0
>
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark

[jira] [Resolved] (SPARK-16251) LocalCheckpointSuite's - missing checkpoint block fails with informative message is flaky.

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16251.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0
   2.0.3

Issue resolved by pull request 18314
[https://github.com/apache/spark/pull/18314]

> LocalCheckpointSuite's - missing checkpoint block fails with informative 
> message is flaky.
> --
>
> Key: SPARK-16251
> URL: https://issues.apache.org/jira/browse/SPARK-16251
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>Priority: Minor
> Fix For: 2.0.3, 2.2.0, 2.1.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20200.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0
   2.0.3

Issue resolved by pull request 18314
[https://github.com/apache/spark/pull/18314]

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Priority: Minor
>  Labels: flaky-test
> Fix For: 2.0.3, 2.2.0, 2.1.2
>
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spar

[jira] [Assigned] (SPARK-16251) LocalCheckpointSuite's - missing checkpoint block fails with informative message is flaky.

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-16251:
---

Assignee: Jiang Xingbo

> LocalCheckpointSuite's - missing checkpoint block fails with informative 
> message is flaky.
> --
>
> Key: SPARK-16251
> URL: https://issues.apache.org/jira/browse/SPARK-16251
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20200) Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20200:
---

Assignee: Jiang Xingbo

> Flaky Test: org.apache.spark.rdd.LocalCheckpointSuite
> -
>
> Key: SPARK-20200
> URL: https://issues.apache.org/jira/browse/SPARK-20200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Takuya Ueshin
>Assignee: Jiang Xingbo
>Priority: Minor
>  Labels: flaky-test
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> This test failed recently here:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/2909/testReport/junit/org.apache.spark.rdd/LocalCheckpointSuite/missing_checkpoint_block_fails_with_informative_message/
> Dashboard
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.rdd.LocalCheckpointSuite&test_name=missing+checkpoint+block+fails+with+informative+message
> Error Message
> {code}
> Collect should have failed if local checkpoint block is removed...
> {code}
> {code}
> org.scalatest.exceptions.TestFailedException: Collect should have failed if 
> local checkpoint block is removed...
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply$mcV$sp(LocalCheckpointSuite.scala:173)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite$$anonfun$16.apply(LocalCheckpointSuite.scala:155)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.rdd.LocalCheckpointSuite.runTest(LocalCheckpointSuite.scala:27)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1

[jira] [Resolved] (SPARK-21112) ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21112.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18318
[https://github.com/apache/spark/pull/18318]

> ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT
> --
>
> Key: SPARK-21112
> URL: https://issues.apache.org/jira/browse/SPARK-21112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> {{ALTER TABLE SET TBLPROPERTIES}} should not overwrite the COMMENT even if 
> the input does not have the property of `COMMENT`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21072.
-
   Resolution: Fixed
 Assignee: coneyliu
Fix Version/s: 2.2.0
   2.1.2

> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>Assignee: coneyliu
> Fix For: 2.1.2, 2.2.0
>
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code  should judge whether it is the children node.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21114) Test failure in Spark 2.1 due to name mismatch

2017-06-15 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21114.
-
   Resolution: Fixed
Fix Version/s: 2.1.2

> Test failure in Spark 2.1 due to name mismatch
> --
>
> Key: SPARK-21114
> URL: https://issues.apache.org/jira/browse/SPARK-21114
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.2
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21119) unset table properties should keep the table comment

2017-06-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21119:
---

 Summary: unset table properties should keep the table comment
 Key: SPARK-21119
 URL: https://issues.apache.org/jira/browse/SPARK-21119
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20994) Alleviate memory pressure in StreamManager

2017-06-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20994.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18231
[https://github.com/apache/spark/pull/18231]

> Alleviate memory pressure in StreamManager
> --
>
> Key: SPARK-20994
> URL: https://issues.apache.org/jira/browse/SPARK-20994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
> Fix For: 2.3.0
>
>
> In my cluster, we are suffering from OOM of shuffle-service.
> We found that a lot of executors are fetching blocks from a single 
> shuffle-service. Analyzing the memory, we found that the 
> blockIds({{shuffle_shuffleId_mapId_reduceId}}) takes about 1.5GBytes.
> In current code, chunks are fetched from shuffle service in two steps:
> Step-1. Send {{OpenBlocks}}, which contains the blocks list to to fetch;
> Step-2. Fetch the consecutive chunks from shuffle-service by {{streamId}} and 
> {{chunkIndex}}
> Thus memory cost can be improved for step-1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20994) Alleviate memory pressure in StreamManager

2017-06-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20994:
---

Assignee: jin xing

> Alleviate memory pressure in StreamManager
> --
>
> Key: SPARK-20994
> URL: https://issues.apache.org/jira/browse/SPARK-20994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> In my cluster, we are suffering from OOM of shuffle-service.
> We found that a lot of executors are fetching blocks from a single 
> shuffle-service. Analyzing the memory, we found that the 
> blockIds({{shuffle_shuffleId_mapId_reduceId}}) takes about 1.5GBytes.
> In current code, chunks are fetched from shuffle service in two steps:
> Step-1. Send {{OpenBlocks}}, which contains the blocks list to to fetch;
> Step-2. Fetch the consecutive chunks from shuffle-service by {{streamId}} and 
> {{chunkIndex}}
> Thus memory cost can be improved for step-1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-16 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052057#comment-16052057
 ] 

Wenchen Fan commented on SPARK-18016:
-

cc [~aeskilson] do you wanna send a new PR to backport it?

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler

[jira] [Resolved] (SPARK-21090) Optimize the unified memory manager code

2017-06-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21090.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18296
[https://github.com/apache/spark/pull/18296]

> Optimize the unified  memory manager code
> -
>
> Key: SPARK-21090
> URL: https://issues.apache.org/jira/browse/SPARK-21090
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
> Fix For: 2.3.0
>
>
> 1.In *acquireStorageMemory*, when the MemoryMode is OFF_HEAP ,the *maxMemory* 
> should be modified to  *maxOffHeapStorageMemory*
> 2. Borrow memory from execution, *numBytes* modified to *numBytes - 
> storagePool.memoryFree* will be more reasonable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21090) Optimize the unified memory manager code

2017-06-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21090:
---

Assignee: liuxian

> Optimize the unified  memory manager code
> -
>
> Key: SPARK-21090
> URL: https://issues.apache.org/jira/browse/SPARK-21090
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
> Fix For: 2.2.0
>
>
> 1.In *acquireStorageMemory*, when the MemoryMode is OFF_HEAP ,the *maxMemory* 
> should be modified to  *maxOffHeapStorageMemory*
> 2. Borrow memory from execution, *numBytes* modified to *numBytes - 
> storagePool.memoryFree* will be more reasonable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21090) Optimize the unified memory manager code

2017-06-18 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21090:

Fix Version/s: (was: 2.3.0)
   2.2.0

> Optimize the unified  memory manager code
> -
>
> Key: SPARK-21090
> URL: https://issues.apache.org/jira/browse/SPARK-21090
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
> Fix For: 2.2.0
>
>
> 1.In *acquireStorageMemory*, when the MemoryMode is OFF_HEAP ,the *maxMemory* 
> should be modified to  *maxOffHeapStorageMemory*
> 2. Borrow memory from execution, *numBytes* modified to *numBytes - 
> storagePool.memoryFree* will be more reasonable



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21132) DISTINCT modifier of function arguments should not be silently ignored

2017-06-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21132.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18340
[https://github.com/apache/spark/pull/18340]

> DISTINCT modifier of function arguments should not be silently ignored
> --
>
> Key: SPARK-21132
> URL: https://issues.apache.org/jira/browse/SPARK-21132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> DISTINCT modifier of function arguments should not be silently ignored when 
> it is not being supported. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21133.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18343
[https://github.com/apache/spark/pull/18343]

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Assigned] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21133:
---

Assignee: Yuming Wang

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecut

[jira] [Resolved] (SPARK-20989) Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode

2017-06-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20989.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18290
[https://github.com/apache/spark/pull/18290]

> Fail to start multiple workers on one host if external shuffle service is 
> enabled in standalone mode
> 
>
> Key: SPARK-20989
> URL: https://issues.apache.org/jira/browse/SPARK-20989
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Priority: Minor
> Fix For: 2.3.0
>
>
> In standalone mode, if we enable external shuffle service by setting 
> `spark.shuffle.service.enabled` to true, and then we try to start multiple 
> workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, 
> and then run `sbin/start-slaves.sh`), we can only launch one worker on each 
> host successfully and the rest of the workers fail to launch.
> The reason is the port of external shuffle service if configed by 
> `spark.shuffle.service.port`, so currently we could start no more than one 
> external shuffle service on each host. In our case, each worker tries to 
> start a external shuffle service, and only one of them successed doing this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20989) Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode

2017-06-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20989:
---

Assignee: Jiang Xingbo

> Fail to start multiple workers on one host if external shuffle service is 
> enabled in standalone mode
> 
>
> Key: SPARK-20989
> URL: https://issues.apache.org/jira/browse/SPARK-20989
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.3.0
>
>
> In standalone mode, if we enable external shuffle service by setting 
> `spark.shuffle.service.enabled` to true, and then we try to start multiple 
> workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, 
> and then run `sbin/start-slaves.sh`), we can only launch one worker on each 
> host successfully and the rest of the workers fail to launch.
> The reason is the port of external shuffle service if configed by 
> `spark.shuffle.service.port`, so currently we could start no more than one 
> external shuffle service on each host. In our case, each worker tries to 
> start a external shuffle service, and only one of them successed doing this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20640) Make rpc timeout and retry for shuffle registration configurable

2017-06-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20640.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18092
[https://github.com/apache/spark/pull/18092]

> Make rpc timeout and retry for shuffle registration configurable
> 
>
> Key: SPARK-20640
> URL: https://issues.apache.org/jira/browse/SPARK-20640
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
> Fix For: 2.3.0
>
>
> Currently the shuffle service registration timeout and retry has been 
> hardcoded (see 
> https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144
>  and 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197).
>  This works well for small workloads but under heavy workload when the 
> shuffle service is busy transferring large amount of data we see significant 
> delay in responding to the registration request, as a result we often see the 
> executors fail to register with the shuffle service, eventually failing the 
> job. We need to make these two parameters configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20640) Make rpc timeout and retry for shuffle registration configurable

2017-06-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20640:
---

Assignee: Li Yichao

> Make rpc timeout and retry for shuffle registration configurable
> 
>
> Key: SPARK-20640
> URL: https://issues.apache.org/jira/browse/SPARK-20640
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
>Assignee: Li Yichao
> Fix For: 2.3.0
>
>
> Currently the shuffle service registration timeout and retry has been 
> hardcoded (see 
> https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144
>  and 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197).
>  This works well for small workloads but under heavy workload when the 
> shuffle service is busy transferring large amount of data we see significant 
> delay in responding to the registration request, as a result we often see the 
> executors fail to register with the shuffle service, eventually failing the 
> job. We need to make these two parameters configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21163:
---

 Summary: DataFrame.toPandas should respect the data type
 Key: SPARK-21163
 URL: https://issues.apache.org/jira/browse/SPARK-21163
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18016:

Fix Version/s: (was: 2.3.0)
   2.2.0
   2.1.2

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.1.2, 2.2.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(Si

[jira] [Resolved] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21163.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18378
[https://github.com/apache/spark/pull/18378]

> DataFrame.toPandas should respect the data type
> ---
>
> Key: SPARK-21163
> URL: https://issues.apache.org/jira/browse/SPARK-21163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20832.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18362
[https://github.com/apache/spark/pull/18362]

> Standalone master should explicitly inform drivers of worker deaths and 
> invalidate external shuffle service outputs
> ---
>
> Key: SPARK-20832
> URL: https://issues.apache.org/jira/browse/SPARK-20832
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
> Fix For: 2.3.0
>
>
> In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
> logic to the DAGScheduler to mark external shuffle service instances as 
> unavailable upon task failure when the task failure reason was "SlaveLost" 
> and this was known to be caused by worker death. If the Spark Master 
> discovered that a worker was dead then it would notify any drivers with 
> executors on those workers to mark those executors as dead. The linked patch 
> simply piggybacked on this logic to have the executor death notification also 
> imply worker death and to have worker-death-caused-executor-death imply 
> shuffle file loss.
> However, there are modes of external shuffle service loss which this 
> mechanism does not detect, leaving the system prone race conditions. Consider 
> the following:
> * Spark standalone is configured to run an external shuffle service embedded 
> in the Worker.
> * Application has shuffle outputs and executors on Worker A.
> * Stage depending on outputs of tasks that ran on Worker A starts.
> * All executors on worker A are removed due to dying with exceptions, 
> scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
> Worker A is still healthy at this point.
> * At this point the MapOutputTracker still records map output locations on 
> Worker A's shuffle service. This is expected behavior. 
> * Worker A dies at an instant where the application has no executors running 
> on it.
> * The Master knows that Worker A died but does not inform the driver (which 
> had no executors on that worker at the time of its death).
> * Some task from the running stage attempts to fetch map outputs from Worker 
> A but these requests time out because Worker A's shuffle service isn't 
> available.
> * Due to other logic in the scheduler, these preventable FetchFailures don't 
> wind up invaliding the now-invalid unavailable map output locations (this is 
> a distinct bug / behavior which I'll discuss in a separate JIRA ticket).
> * This behavior leads to several unsuccessful stage reattempts and ultimately 
> to a job failure.
> A simple way to address this would be to have the Master explicitly notify 
> drivers of all Worker deaths, even if those drivers don't currently have 
> executors. The Spark Standalone scheduler backend can receive the explicit 
> WorkerLost message and can bubble up the right calls to the task scheduler 
> and DAGScheduler to invalidate map output locations from the now-dead 
> external shuffle service.
> This relates to SPARK-20115 in the sense that both tickets aim to address 
> issues where the external shuffle service is unavailable. The key difference 
> is the mechanism for detection: SPARK-20115 marks the external shuffle 
> service as unavailable whenever any fetch failure occurs from it, whereas the 
> proposal here relies on more explicit signals. This JIRA ticket's proposal is 
> scoped only to Spark Standalone mode. As a compromise, we might be able to 
> consider "all of a single shuffle's outputs lost on a single external shuffle 
> service" following a fetch failure (to be discussed in separate JIRA). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20832:
---

Assignee: Jiang Xingbo

> Standalone master should explicitly inform drivers of worker deaths and 
> invalidate external shuffle service outputs
> ---
>
> Key: SPARK-20832
> URL: https://issues.apache.org/jira/browse/SPARK-20832
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.3.0
>
>
> In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
> logic to the DAGScheduler to mark external shuffle service instances as 
> unavailable upon task failure when the task failure reason was "SlaveLost" 
> and this was known to be caused by worker death. If the Spark Master 
> discovered that a worker was dead then it would notify any drivers with 
> executors on those workers to mark those executors as dead. The linked patch 
> simply piggybacked on this logic to have the executor death notification also 
> imply worker death and to have worker-death-caused-executor-death imply 
> shuffle file loss.
> However, there are modes of external shuffle service loss which this 
> mechanism does not detect, leaving the system prone race conditions. Consider 
> the following:
> * Spark standalone is configured to run an external shuffle service embedded 
> in the Worker.
> * Application has shuffle outputs and executors on Worker A.
> * Stage depending on outputs of tasks that ran on Worker A starts.
> * All executors on worker A are removed due to dying with exceptions, 
> scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
> Worker A is still healthy at this point.
> * At this point the MapOutputTracker still records map output locations on 
> Worker A's shuffle service. This is expected behavior. 
> * Worker A dies at an instant where the application has no executors running 
> on it.
> * The Master knows that Worker A died but does not inform the driver (which 
> had no executors on that worker at the time of its death).
> * Some task from the running stage attempts to fetch map outputs from Worker 
> A but these requests time out because Worker A's shuffle service isn't 
> available.
> * Due to other logic in the scheduler, these preventable FetchFailures don't 
> wind up invaliding the now-invalid unavailable map output locations (this is 
> a distinct bug / behavior which I'll discuss in a separate JIRA ticket).
> * This behavior leads to several unsuccessful stage reattempts and ultimately 
> to a job failure.
> A simple way to address this would be to have the Master explicitly notify 
> drivers of all Worker deaths, even if those drivers don't currently have 
> executors. The Spark Standalone scheduler backend can receive the explicit 
> WorkerLost message and can bubble up the right calls to the task scheduler 
> and DAGScheduler to invalidate map output locations from the now-dead 
> external shuffle service.
> This relates to SPARK-20115 in the sense that both tickets aim to address 
> issues where the external shuffle service is unavailable. The key difference 
> is the mechanism for detection: SPARK-20115 marks the external shuffle 
> service as unavailable whenever any fetch failure occurs from it, whereas the 
> proposal here relies on more explicit signals. This JIRA ticket's proposal is 
> scoped only to Spark Standalone mode. As a compromise, we might be able to 
> consider "all of a single shuffle's outputs lost on a single external shuffle 
> service" following a fetch failure (to be discussed in separate JIRA). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13534.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 15821
[https://github.com/apache/spark/pull/15821]

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-13534:
---

Assignee: Bryan Cutler

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20923:
---

Assignee: Thomas Graves

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20923.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20923:

Labels: releasenotes  (was: )

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060283#comment-16060283
 ] 

Wenchen Fan commented on SPARK-20923:
-

This patch changes the public behavior and we should mention it in the release 
notes. Basically users can track the status of updated blocks via 
{{SparkListenerTaskEnd}} event, but this feature was introduced for internal 
usage at the beginning and I'm wondering how many users are using this feature. 
After this patch we don't trach it anymore by default, users can still turn it 
on by setting {{spark.taskMetrics.trackUpdatedBlockStatuses}} to true.

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21174.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18387
[https://github.com/apache/spark/pull/18387]

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21174:
---

Assignee: Gengliang Wang

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21047) Add test suites for complicated cases in ColumnarBatchSuite

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21047:
---

Assignee: jin xing

> Add test suites for complicated cases in ColumnarBatchSuite
> ---
>
> Key: SPARK-21047
> URL: https://issues.apache.org/jira/browse/SPARK-21047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> Current {{ColumnarBatchSuite}} has very simple test cases for array. This 
> JIRA will add test suites for complicated cases such as nested array in 
> {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21047) Add test suites for complicated cases in ColumnarBatchSuite

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21047.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18327
[https://github.com/apache/spark/pull/18327]

> Add test suites for complicated cases in ColumnarBatchSuite
> ---
>
> Key: SPARK-21047
> URL: https://issues.apache.org/jira/browse/SPARK-21047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> Current {{ColumnarBatchSuite}} has very simple test cases for array. This 
> JIRA will add test suites for complicated cases such as nested array in 
> {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21165.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18386
[https://github.com/apache/spark/pull/18386]

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.2.0
>
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> or

[jira] [Assigned] (SPARK-21115) If the cores left is less than the coresPerExecutor,the cores left will not be allocated, so it should not to check in every schedule

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21115:
---

Assignee: eaton

> If the cores left is less than the coresPerExecutor,the cores left will not 
> be allocated, so it should not to check in every schedule
> -
>
> Key: SPARK-21115
> URL: https://issues.apache.org/jira/browse/SPARK-21115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: eaton
>Assignee: eaton
>Priority: Minor
> Fix For: 2.3.0
>
>
> If we start an app with the param --total-executor-cores=4 and 
> spark.executor.cores=3, the cores left is always 1, so it will try to 
> allocate executors in the function 
> org.apache.spark.deploy.master.startExecutorsOnWorkers in every schedule.
> Another question is, is it will be better to allocate another executor with 1 
> core for the cores left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21115) If the cores left is less than the coresPerExecutor,the cores left will not be allocated, so it should not to check in every schedule

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21115.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18322
[https://github.com/apache/spark/pull/18322]

> If the cores left is less than the coresPerExecutor,the cores left will not 
> be allocated, so it should not to check in every schedule
> -
>
> Key: SPARK-21115
> URL: https://issues.apache.org/jira/browse/SPARK-21115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: eaton
>Priority: Minor
> Fix For: 2.3.0
>
>
> If we start an app with the param --total-executor-cores=4 and 
> spark.executor.cores=3, the cores left is always 1, so it will try to 
> allocate executors in the function 
> org.apache.spark.deploy.master.startExecutorsOnWorkers in every schedule.
> Another question is, is it will be better to allocate another executor with 1 
> core for the cores left.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21193:
---

Assignee: Hyukjin Kwon

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21193) Specify Pandas version in setup.py

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21193.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18403
[https://github.com/apache/spark/pull/18403]

> Specify Pandas version in setup.py
> --
>
> Key: SPARK-21193
> URL: https://issues.apache.org/jira/browse/SPARK-21193
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> It looks we don't specify Pandas version in 
> https://github.com/apache/spark/blob/master/python/setup.py#L202. It looks 
> few versions do not work with Spark anymore. It might be better to explicitly 
> set this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21159:
---

Assignee: Marcelo Vanzin

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>Assignee: Marcelo Vanzin
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.s

[jira] [Resolved] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21159.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18397
[https://github.com/apache/spark/pull/18397]

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>   at 
> org.apache.spark.Spar

[jira] [Updated] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21159:

Fix Version/s: (was: 2.3.0)
   2.2.0
   2.1.2

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>Assignee: Marcelo Vanzin
> Fix For: 2.1.2, 2.2.0
>
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(Spar

[jira] [Resolved] (SPARK-21203) Wrong results of insertion of Array of Struct

2017-06-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21203.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> Wrong results of insertion of Array of Struct
> -
>
> Key: SPARK-21203
> URL: https://issues.apache.org/jira/browse/SPARK-21203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.1.2, 2.2.0
>
>
> {noformat}
>   spark.sql(
> """
>   |CREATE TABLE `tab1`
>   |(`custom_fields` ARRAY>)
>   |USING parquet
> """.stripMargin)
>   spark.sql(
> """
>   |INSERT INTO `tab1`
>   |SELECT ARRAY(named_struct('id', 1, 'value', 'a'), 
> named_struct('id', 2, 'value', 'b'))
> """.stripMargin)
>   spark.sql("SELECT custom_fields.id, custom_fields.value FROM 
> tab1").show()
> {noformat}
> The returned result is wrong:
> {noformat}
> Row(Array(2, 2), Array("b", "b"))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18016:

Fix Version/s: (was: 2.1.2)

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>   at 
> org.codehaus.janino.

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-24 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1606#comment-1606
 ] 

Wenchen Fan commented on SPARK-18016:
-

ok reverted

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)

[jira] [Assigned] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21196:
---

Assignee: Gengliang Wang

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21196.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18409
[https://github.com/apache/spark/pull/18409]

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21229:
---

 Summary: remove QueryPlan.preCanonicalized
 Key: SPARK-21229
 URL: https://issues.apache.org/jira/browse/SPARK-21229
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19104.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18418
[https://github.com/apache/spark/pull/18418]

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
> Fix For: 2.2.0
>
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyE

[jira] [Assigned] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19104:
---

Assignee: Liang-Chi Hsieh

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompil

[jira] [Resolved] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21155.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18369
[https://github.com/apache/spark/pull/18369]

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21155:
---

Assignee: Eric Vandenberg

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Assignee: Eric Vandenberg
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21238:
---

 Summary: allow nested SQL execution
 Key: SPARK-21238
 URL: https://issues.apache.org/jira/browse/SPARK-21238
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21222:
---

Assignee: Gengliang Wang

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Move elimination of Distinct clause from analyzer to optimizer
> Distinct clause is useless after MAX/MIN clause. For example,
> "Select MAX(distinct a) FROM src from"
> is equivalent of
> "Select MAX(a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21222.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18429
[https://github.com/apache/spark/pull/18429]

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Move elimination of Distinct clause from analyzer to optimizer
> Distinct clause is useless after MAX/MIN clause. For example,
> "Select MAX(distinct a) FROM src from"
> is equivalent of
> "Select MAX(a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21229.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18440
[https://github.com/apache/spark/pull/18440]

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21237:
---

Assignee: Zhenhua Wang

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21237.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18449
[https://github.com/apache/spark/pull/18449]

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3577) Add task metric to report spill time

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-3577.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17471
[https://github.com/apache/spark/pull/17471]

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3577) Add task metric to report spill time

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-3577:
--

Assignee: Sital Kedia

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Assignee: Sital Kedia
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21238.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18450
[https://github.com/apache/spark/pull/18450]

> allow nested SQL execution
> --
>
> Key: SPARK-21238
> URL: https://issues.apache.org/jira/browse/SPARK-21238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21225.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18435
[https://github.com/apache/spark/pull/18435]

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21225:
---

Assignee: yangZhiguo

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Assignee: yangZhiguo
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21052) Add hash map metrics to join

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21052:
---

Assignee: Liang-Chi Hsieh

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21052) Add hash map metrics to join

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21052.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18301
[https://github.com/apache/spark/pull/18301]

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21253) Cannot fetch big blocks to disk

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21253.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18472
[https://github.com/apache/spark/pull/18472]

> Cannot fetch big blocks to disk 
> 
>
> Key: SPARK-21253
> URL: https://issues.apache.org/jira/browse/SPARK-21253
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
> Attachments: ui-thread-dump-jqhadoop221-154.gif
>
>
> Spark *cluster* can reproduce, *local* can't:
> 1. Start a spark context with {{spark.reducer.maxReqSizeShuffleToMem=1K}}:
> {code:actionscript}
> $ spark-shell --conf spark.reducer.maxReqSizeShuffleToMem=1K
> {code}
> 2. A shuffle:
> {code:actionscript}
> scala> sc.parallelize(0 until 300, 10).repartition(2001).count()
> {code}
> The error messages:
> {noformat}
> org.apache.spark.shuffle.FetchFailedException: Failed to send request for 
> 1649611690367_2 to yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: 
> java.io.IOException: Connection reset by peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
> at scala.collection.AbstractIterator.to(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:936)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to send request for 1649611690367_2 to 
> yhd-jqhadoop166.int.yihaodian.com/10.17.28.166:7337: java.io.IOException: 
> Connection reset by peer
> at 
> org.apache.spark.network.client.TransportClient.lambda$stream$1(TransportClient.java:196)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at 
> io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:163)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:93)
> at 
> io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:28)
> at 
> org.apache.spark.network.client.TransportClient.stream(Tra

[jira] [Resolved] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21176.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 18437
[https://github.com/apache/spark/pull/18437]

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
> Fix For: 2.2.0, 2.1.2
>
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21176:
---

Assignee: Ingo Schuster

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>Assignee: Ingo Schuster
>  Labels: network, web-ui
> Fix For: 2.1.2, 2.2.0
>
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21258) Window result incorrect using complex object with spilling

2017-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21258.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 18470
[https://github.com/apache/spark/pull/18470]

> Window result incorrect using complex object with spilling
> --
>
> Key: SPARK-21258
> URL: https://issues.apache.org/jira/browse/SPARK-21258
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0, 2.1.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17528) data should be copied properly before saving into InternalRow

2017-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17528:

Summary: data should be copied properly before saving into InternalRow  
(was: MutableProjection should not cache content from the input row)

> data should be copied properly before saving into InternalRow
> -
>
> Key: SPARK-17528
> URL: https://issues.apache.org/jira/browse/SPARK-17528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069665#comment-16069665
 ] 

Wenchen Fan commented on SPARK-21190:
-

For aggregate, I think it makes more sense to do the grouping at Spark side, 
and then send each group to Python, and use Pandas to do reduce or add extra 
column. We can have a config for group max size, and when sending groups from 
JVM to Python, fail the query if the max size is exceeded.

For window, like [~leif] said, ideally we should not send each window to 
python, to avoid transferring a lot of duplicate data. Instead, we can send the 
entire partition to python, in streaming manner. Then we do the windowing at 
python side and use Pandas to do reduce or something. The downside is we need 
to re-implement many things in Python.

I'll think about the API too.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same im

[jira] [Resolved] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18294.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18438
[https://github.com/apache/spark/pull/18438]

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
> Fix For: 2.3.0
>
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18294:
---

Assignee: Jiang Xingbo

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.3.0
>
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17924) Consolidate streaming and batch write path

2017-06-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070019#comment-16070019
 ] 

Wenchen Fan commented on SPARK-17924:
-

cc [~rxin] can we resolve this ticket? all subtasks are done.

> Consolidate streaming and batch write path
> --
>
> Key: SPARK-17924
> URL: https://issues.apache.org/jira/browse/SPARK-17924
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Structured streaming and normal SQL operation currently have two separate 
> write path, leading to a lot of duplicated functions (that look similar) and 
> if branches. The purpose of this ticket is to consolidate the two as much as 
> possible to make the write path more clear.
> A side-effect of this change is that streaming will automatically support all 
> the file formats.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17528) data should be copied properly before saving into InternalRow

2017-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17528.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18483
[https://github.com/apache/spark/pull/18483]

> data should be copied properly before saving into InternalRow
> -
>
> Key: SPARK-17528
> URL: https://issues.apache.org/jira/browse/SPARK-17528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8

2017-06-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070944#comment-16070944
 ] 

Wenchen Fan commented on SPARK-21271:
-

We do have this regulation for var-length part in UnsafeRow, because var-length 
data is always word-aligned. This is also why we store the length information, 
otherwise we can just calculate the length by subtracting 2 adjacent offsets.

> UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
> ---
>
> Key: SPARK-21271
> URL: https://issues.apache.org/jira/browse/SPARK-21271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> The method is:
> {code}
> public int hashCode() {
> return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, 
> sizeInBytes, 42);
>   }
> {code}
> but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords 
> throws assertion) - for example here: 
> {code}FixedLengthRowBasedKeyValueBatch.appendRow{code}
> The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a 
> prefix that is multiple of 8.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21127) Update statistics after data changing commands

2017-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21127:
---

Assignee: Zhenhua Wang

> Update statistics after data changing commands
> --
>
> Key: SPARK-21127
> URL: https://issues.apache.org/jira/browse/SPARK-21127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21127) Update statistics after data changing commands

2017-06-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21127.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18334
[https://github.com/apache/spark/pull/18334]

> Update statistics after data changing commands
> --
>
> Key: SPARK-21127
> URL: https://issues.apache.org/jira/browse/SPARK-21127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8

2017-06-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070963#comment-16070963
 ] 

Wenchen Fan commented on SPARK-21271:
-

For word-aligned I mean 8-bytes aligned, so the size of var-lengh data is 
always a multiple of 8.

> UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
> ---
>
> Key: SPARK-21271
> URL: https://issues.apache.org/jira/browse/SPARK-21271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> The method is:
> {code}
> public int hashCode() {
> return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, 
> sizeInBytes, 42);
>   }
> {code}
> but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords 
> throws assertion) - for example here: 
> {code}FixedLengthRowBasedKeyValueBatch.appendRow{code}
> The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a 
> prefix that is multiple of 8.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21271) UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8

2017-06-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070969#comment-16070969
 ] 

Wenchen Fan commented on SPARK-21271:
-

yea we should. BTW the code seems wrong to be, the length of value row should 
be {{vlen}} instead of {{vlen + 4}}

> UnsafeRow.hashCode assertion when sizeInBytes not multiple of 8
> ---
>
> Key: SPARK-21271
> URL: https://issues.apache.org/jira/browse/SPARK-21271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> The method is:
> {code}
> public int hashCode() {
> return Murmur3_x86_32.hashUnsafeWords(baseObject, baseOffset, 
> sizeInBytes, 42);
>   }
> {code}
> but sizeInBytes is not always a multiple of 8 (in which case hashUnsafeWords 
> throws assertion) - for example here: 
> {code}FixedLengthRowBasedKeyValueBatch.appendRow{code}
> The fix could be to use hashUnsafeBytes or to use hashUnsafeWords but on a 
> prefix that is multiple of 8.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071002#comment-16071002
 ] 

Wenchen Fan commented on SPARK-21190:
-

Thanks for your proposal!

I have 2 thoughts:
1. How should we handle null values? IIRC, {{pd.Series}} doesn't work well with 
null, e.g. int series with null becomes float type, boolean series with null 
becomes object type.
2. I think UDF returning scalar value only makes sense for group/window, 
otheriwse the result is indeterministic for users.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Resolved] (SPARK-21282) Fix test failure in 2.0

2017-07-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21282.
-
   Resolution: Fixed
Fix Version/s: 2.0.3

Issue resolved by pull request 18506
[https://github.com/apache/spark/pull/18506]

> Fix test failure in 2.0
> ---
>
> Key: SPARK-21282
> URL: https://issues.apache.org/jira/browse/SPARK-21282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.3
>
>
> There is a test failure after backporting a fix from 2.2 to 2.0, because the 
> automatically generated column names are different. 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page

2017-07-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21250.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18464
[https://github.com/apache/spark/pull/18464]

> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page
> ---
>
> Key: SPARK-21250
> URL: https://issues.apache.org/jira/browse/SPARK-21250
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page.
> When I click URL of 'Name', the current page jumps to the job page. Of course 
> this is only in the table of 'Running Executors'.
> This URL of 'Name' is in the table of 'Finished Executors' does not exist, 
> the click will not jump to any page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21250) Add a url in the table of 'Running Executors' in worker page to visit job page

2017-07-02 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21250:
---

Assignee: guoxiaolongzte

> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page
> ---
>
> Key: SPARK-21250
> URL: https://issues.apache.org/jira/browse/SPARK-21250
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add a url in the table of 'Running Executors'  in worker page to visit job 
> page.
> When I click URL of 'Name', the current page jumps to the job page. Of course 
> this is only in the table of 'Running Executors'.
> This URL of 'Name' is in the table of 'Finished Executors' does not exist, 
> the click will not jump to any page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-03 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072019#comment-16072019
 ] 

Wenchen Fan commented on SPARK-21190:
-

> I think we can get away with doing windowing (deciding which rows to 
> aggregate) in scala to avoid reimplementing that logic in python, and then 
> send row indexes (a list of begin/end pairs which each are the bounds of a 
> window) over to python to select the windows.

Does this mean we need to keep the data of the whole window in both JVM and 
Python? This seems too expensive...

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsub

[jira] [Created] (SPARK-21284) rename SessionCatalog.registerFunction parameter name

2017-07-03 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21284:
---

 Summary: rename SessionCatalog.registerFunction parameter name
 Key: SPARK-21284
 URL: https://issues.apache.org/jira/browse/SPARK-21284
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21137) Spark reads many small files slowly off local filesystem

2017-07-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21137.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18441
[https://github.com/apache/spark/pull/18441]

> Spark reads many small files slowly off local filesystem
> 
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
> Fix For: 2.3.0
>
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21137) Spark reads many small files slowly off local filesystem

2017-07-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21137:
---

Assignee: Sean Owen

> Spark reads many small files slowly off local filesystem
> 
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21283) FileOutputStream should be created as append mode

2017-07-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21283:
---

Assignee: liuxian

> FileOutputStream should be created as append mode
> -
>
> Key: SPARK-21283
> URL: https://issues.apache.org/jira/browse/SPARK-21283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.3.0
>
>
> `FileAppender` is used to write `stderr` and `stdout` files  in 
> `ExecutorRunner`.
>  But before writing `ErrorStream` into the the `stderr` file, the header 
> information has been written into ,if  FileOutputStream is  not created as 
> append mode, the  header information will be lost



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21283) FileOutputStream should be created as append mode

2017-07-03 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21283.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18507
[https://github.com/apache/spark/pull/18507]

> FileOutputStream should be created as append mode
> -
>
> Key: SPARK-21283
> URL: https://issues.apache.org/jira/browse/SPARK-21283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
> Fix For: 2.3.0
>
>
> `FileAppender` is used to write `stderr` and `stdout` files  in 
> `ExecutorRunner`.
>  But before writing `ErrorStream` into the the `stderr` file, the header 
> information has been written into ,if  FileOutputStream is  not created as 
> append mode, the  header information will be lost



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-07-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19507.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18521
[https://github.com/apache/spark/pull/18521]

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21296) Avoid per-record type dispatch in PySpark createDataFrame schema verification

2017-07-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21296.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18521
[https://github.com/apache/spark/pull/18521]

> Avoid per-record type dispatch in PySpark createDataFrame schema verification
> -
>
> Key: SPARK-21296
> URL: https://issues.apache.org/jira/browse/SPARK-21296
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Here, we do the type dispatch per record - 
> https://github.com/apache/spark/blob/d935e0a9d9bb3d3c74e9529e161648caa50696b7/python/pyspark/sql/types.py#L1252-L1366
>  and
> https://github.com/apache/spark/blob/d935e0a9d9bb3d3c74e9529e161648caa50696b7/python/pyspark/sql/session.py#L517-L537
> This requires per-record operation cost. We should get rid of this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21296) Avoid per-record type dispatch in PySpark createDataFrame schema verification

2017-07-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21296:
---

Assignee: Hyukjin Kwon

> Avoid per-record type dispatch in PySpark createDataFrame schema verification
> -
>
> Key: SPARK-21296
> URL: https://issues.apache.org/jira/browse/SPARK-21296
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> Here, we do the type dispatch per record - 
> https://github.com/apache/spark/blob/d935e0a9d9bb3d3c74e9529e161648caa50696b7/python/pyspark/sql/types.py#L1252-L1366
>  and
> https://github.com/apache/spark/blob/d935e0a9d9bb3d3c74e9529e161648caa50696b7/python/pyspark/sql/session.py#L517-L537
> This requires per-record operation cost. We should get rid of this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-07-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19507:
---

Assignee: Hyukjin Kwon

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >