[jira] [Assigned] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package
[ https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16813: Assignee: Reynold Xin (was: Apache Spark) > Remove private[sql] and private[spark] from catalyst package > > > Key: SPARK-16813 > URL: https://issues.apache.org/jira/browse/SPARK-16813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > The catalyst package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package
[ https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400468#comment-15400468 ] Apache Spark commented on SPARK-16813: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14418 > Remove private[sql] and private[spark] from catalyst package > > > Key: SPARK-16813 > URL: https://issues.apache.org/jira/browse/SPARK-16813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > The catalyst package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package
[ https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16813: Assignee: Apache Spark (was: Reynold Xin) > Remove private[sql] and private[spark] from catalyst package > > > Key: SPARK-16813 > URL: https://issues.apache.org/jira/browse/SPARK-16813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > The catalyst package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package
Reynold Xin created SPARK-16813: --- Summary: Remove private[sql] and private[spark] from catalyst package Key: SPARK-16813 URL: https://issues.apache.org/jira/browse/SPARK-16813 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin The catalyst package is meant to be internal, and as a result it does not make sense to mark things as private[sql] or private[spark]. It simply makes debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16812) Open up SparkILoop.getAddedJars
[ https://issues.apache.org/jira/browse/SPARK-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16812: Assignee: Reynold Xin (was: Apache Spark) > Open up SparkILoop.getAddedJars > --- > > Key: SPARK-16812 > URL: https://issues.apache.org/jira/browse/SPARK-16812 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Reporter: Reynold Xin >Assignee: Reynold Xin > > SparkILoop.getAddedJars is a useful method to use so we can programmatically > get the list of jars added. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16812) Open up SparkILoop.getAddedJars
[ https://issues.apache.org/jira/browse/SPARK-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16812: Assignee: Apache Spark (was: Reynold Xin) > Open up SparkILoop.getAddedJars > --- > > Key: SPARK-16812 > URL: https://issues.apache.org/jira/browse/SPARK-16812 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Reporter: Reynold Xin >Assignee: Apache Spark > > SparkILoop.getAddedJars is a useful method to use so we can programmatically > get the list of jars added. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16812) Open up SparkILoop.getAddedJars
[ https://issues.apache.org/jira/browse/SPARK-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400442#comment-15400442 ] Apache Spark commented on SPARK-16812: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14417 > Open up SparkILoop.getAddedJars > --- > > Key: SPARK-16812 > URL: https://issues.apache.org/jira/browse/SPARK-16812 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Reporter: Reynold Xin >Assignee: Reynold Xin > > SparkILoop.getAddedJars is a useful method to use so we can programmatically > get the list of jars added. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16812) Open up SparkILoop.getAddedJars
Reynold Xin created SPARK-16812: --- Summary: Open up SparkILoop.getAddedJars Key: SPARK-16812 URL: https://issues.apache.org/jira/browse/SPARK-16812 Project: Spark Issue Type: Improvement Components: Spark Shell Reporter: Reynold Xin Assignee: Reynold Xin SparkILoop.getAddedJars is a useful method to use so we can programmatically get the list of jars added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16776) Fix Kafka deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16776: Assignee: Apache Spark > Fix Kafka deprecation warnings > -- > > Key: SPARK-16776 > URL: https://issues.apache.org/jira/browse/SPARK-16776 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: holdenk >Assignee: Apache Spark > > The new KafkaTestUtils depends on many deprecated APIs - see if we can > refactor this to be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16776) Fix Kafka deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16776: Assignee: (was: Apache Spark) > Fix Kafka deprecation warnings > -- > > Key: SPARK-16776 > URL: https://issues.apache.org/jira/browse/SPARK-16776 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: holdenk > > The new KafkaTestUtils depends on many deprecated APIs - see if we can > refactor this to be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16776) Fix Kafka deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400439#comment-15400439 ] Apache Spark commented on SPARK-16776: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14416 > Fix Kafka deprecation warnings > -- > > Key: SPARK-16776 > URL: https://issues.apache.org/jira/browse/SPARK-16776 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: holdenk > > The new KafkaTestUtils depends on many deprecated APIs - see if we can > refactor this to be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16732) Remove unused codes in subexpressionEliminationForWholeStageCodegen
[ https://issues.apache.org/jira/browse/SPARK-16732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16732: Assignee: yucai > Remove unused codes in subexpressionEliminationForWholeStageCodegen > --- > > Key: SPARK-16732 > URL: https://issues.apache.org/jira/browse/SPARK-16732 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: yucai >Assignee: yucai > Fix For: 2.0.1, 2.1.0 > > > Some codes in subexpressionEliminationForWholeStageCodegen are never used > actually. Remove them using this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16732) Remove unused codes in subexpressionEliminationForWholeStageCodegen
[ https://issues.apache.org/jira/browse/SPARK-16732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16732. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Remove unused codes in subexpressionEliminationForWholeStageCodegen > --- > > Key: SPARK-16732 > URL: https://issues.apache.org/jira/browse/SPARK-16732 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: yucai > Fix For: 2.0.1, 2.1.0 > > > Some codes in subexpressionEliminationForWholeStageCodegen are never used > actually. Remove them using this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16776) Fix Kafka deprecation warnings
[ https://issues.apache.org/jira/browse/SPARK-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400433#comment-15400433 ] Hyukjin Kwon commented on SPARK-16776: -- Oh, I meant to just leave the warnings but let me do this quickly. Thanks for asking this! > Fix Kafka deprecation warnings > -- > > Key: SPARK-16776 > URL: https://issues.apache.org/jira/browse/SPARK-16776 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: holdenk > > The new KafkaTestUtils depends on many deprecated APIs - see if we can > refactor this to be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16748. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14395 [https://github.com/apache/spark/pull/14395] > Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY > clause > --- > > Key: SPARK-16748 > URL: https://issues.apache.org/jira/browse/SPARK-16748 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Tathagata Das > Fix For: 2.0.1, 2.1.0 > > > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((c: String) => s"""${c.take(5)}""") > spark.sql("SELECT cast(null as string) as > a").select(myUDF($"a").as("b")).orderBy($"b").collect > {code} > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(b#345 ASC, 200) > +- *Project [UDF(null) AS b#345] >+- Scan OneRowRelation[] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16811) spark streaming-Exception in thread “submit-job-thread-pool-40”
BinXu created SPARK-16811: - Summary: spark streaming-Exception in thread “submit-job-thread-pool-40” Key: SPARK-16811 URL: https://issues.apache.org/jira/browse/SPARK-16811 Project: Spark Issue Type: Question Components: Streaming Affects Versions: 1.6.1 Environment: spark 1.6.1 Reporter: BinXu Priority: Blocker {color:red}I am using spark 1.6. When I run streaming application for a few minutes,it cause such Exception Like this:{color} Exception in thread "submit-job-thread-pool-40" Exception in thread "submit-job-thread-pool-33" Exception in thread "submit-job-thread-pool-23" Exception in thread "submit-job-thread-pool-14" Exception in thread "submit-job-thread-pool-29" Exception in thread "submit-job-thread-pool-39" Exception in thread "submit-job-thread-pool-2" java.lang.Error: java.lang.InterruptedException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) at org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ... 2 more java.lang.Error: java.lang.InterruptedException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) at org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ... 2 more java.lang.Error: java.lang.InterruptedException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) at org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ... 2 more java.lang.Error: java.lang.InterruptedException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) at org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ... 2 more java.lang.Error: java.lang.InterruptedException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) at org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ... 2 more {color:red} This problem is so wield,the consume speed is enough.My batch time is 20s, and the first few minutes(app is ok),the process time is 8s. But After that,the
[jira] [Resolved] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Jeffrey resolved SPARK-16797. --- Resolution: Fixed Resolved in Spark 2.0.0 > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > Fix For: 2.0.0 > > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Jeffrey updated SPARK-16797: -- Affects Version/s: (was: 2.0.0) Fix Version/s: 2.0.0 > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > Fix For: 2.0.0 > > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400362#comment-15400362 ] Bryan Jeffrey commented on SPARK-16797: --- I just ran the same thing on Spark 2.0.0 as released, and the issue does appear to be resolved. I'm sorry about the confusion - I had run this earlier on a 2.0 preview & thought I was seeing the same issue. We can close this as fixed in 2.0.0 Thank you! > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400356#comment-15400356 ] Bryan Jeffrey edited comment on SPARK-16797 at 7/30/16 1:37 AM: Hello. Here is a simple example using Spark 1.6.1: {code} import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, StreamingContext} object SimpleExample { def main(args: Array[String]): Unit = { val appName = "Simple Example" val sparkConf = new SparkConf().setAppName(appName) val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(5)) val input = Array(1,2,3,4,5,6,7,8) val queue = scala.collection.mutable.Queue(ssc.sparkContext.parallelize(input)) val data: InputDStream[Int] = ssc.queueStream(queue) data.foreachRDD(x => println("Initial Count: " + x.count())) val streamPartition = data.repartition(0) streamPartition.foreachRDD(x => println("Stream Repartition: " + x.count())) val rddPartition = data.transform(x => x.repartition(0)) rddPartition.foreachRDD(x => println("Rdd Repartition: " + x.count())) val singlePartitionStreamPartition = data.repartition(1) singlePartitionStreamPartition.foreachRDD(x => println("Stream w/ Single Partition: " + x.count())) ssc.start() ssc.awaitTermination() } } {code} Output: {code} Initial Count: 8 Stream Repartition: 0 Rdd Repartition: 0 Stream w/ Single Partition: 8 ... More streaming output (but obviously zero counts as we're using an InputDStream w/ no other data) ... {code} was (Author: bryan.jeff...@gmail.com): Hello. Here is a simple example using Spark 1.6.1: {code} import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, StreamingContext} object SimpleExample { def main(args: Array[String]): Unit = { val appName = "Simple Example" val sparkConf = new SparkConf().setAppName(appName) val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(5)) val input = Array(1,2,3,4,5,6,7,8) val queue = scala.collection.mutable.Queue(ssc.sparkContext.parallelize(input)) val data: InputDStream[Int] = ssc.queueStream(queue) data.foreachRDD(x => println("Initial Count: " + x.count())) val streamPartition = data.repartition(0) streamPartition.foreachRDD(x => println("Stream Repartition: " + x.count())) val rddPartition = data.transform(x => x.repartition(0)) rddPartition.foreachRDD(x => println("Rdd Repartition: " + x.count())) val singlePartitionStreamPartition = data.repartition(1) singlePartitionStreamPartition.foreachRDD(x => println("Stream w/ Single Partition: " + x.count())) ssc.start() ssc.awaitTermination() } } {code} Output: {code} Initial Count: 8 Stream Repartition: 0 Rdd Repartition: 0 Stream w/ Single Partition: 8 Initial Count: 0 Stream Repartition: 0 Rdd Repartition: 0 Stream w/ Single Partition: 0 {code} > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400356#comment-15400356 ] Bryan Jeffrey edited comment on SPARK-16797 at 7/30/16 1:36 AM: Hello. Here is a simple example using Spark 1.6.1: {code} import org.apache.spark.streaming.dstream.InputDStream import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, StreamingContext} object SimpleExample { def main(args: Array[String]): Unit = { val appName = "Simple Example" val sparkConf = new SparkConf().setAppName(appName) val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(5)) val input = Array(1,2,3,4,5,6,7,8) val queue = scala.collection.mutable.Queue(ssc.sparkContext.parallelize(input)) val data: InputDStream[Int] = ssc.queueStream(queue) data.foreachRDD(x => println("Initial Count: " + x.count())) val streamPartition = data.repartition(0) streamPartition.foreachRDD(x => println("Stream Repartition: " + x.count())) val rddPartition = data.transform(x => x.repartition(0)) rddPartition.foreachRDD(x => println("Rdd Repartition: " + x.count())) val singlePartitionStreamPartition = data.repartition(1) singlePartitionStreamPartition.foreachRDD(x => println("Stream w/ Single Partition: " + x.count())) ssc.start() ssc.awaitTermination() } } {code} Output: {code} Initial Count: 8 Stream Repartition: 0 Rdd Repartition: 0 Stream w/ Single Partition: 8 Initial Count: 0 Stream Repartition: 0 Rdd Repartition: 0 Stream w/ Single Partition: 0 {code} was (Author: bryan.jeff...@gmail.com): Hello. Here is a simple example using Spark 1.6.1:
[jira] [Commented] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400356#comment-15400356 ] Bryan Jeffrey commented on SPARK-16797: --- Hello. Here is a simple example using Spark 1.6.1:
[jira] [Updated] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16802: Affects Version/s: (was: 2.0.1) > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Priority: Critical > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Updated] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16802: Priority: Blocker (was: Major) > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Priority: Blocker > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Updated] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16802: Priority: Critical (was: Blocker) > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Priority: Critical > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Updated] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16802: Target Version/s: 2.0.1, 2.1.0 > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Priority: Critical > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Updated] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16804: Target Version/s: 2.0.1, 2.1.0 (was: 2.0.1) > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16810) Refactor registerSinks with multiple constructos
[ https://issues.apache.org/jira/browse/SPARK-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400263#comment-15400263 ] Apache Spark commented on SPARK-16810: -- User 'lovexi' has created a pull request for this issue: https://github.com/apache/spark/pull/14415 > Refactor registerSinks with multiple constructos > > > Key: SPARK-16810 > URL: https://issues.apache.org/jira/browse/SPARK-16810 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu > > For some metrics, it may require some app detailed information from > SparkConf. So for those sinks, we need to pass SparkConf via sink > constructor. Refactored registerSink class reflection part to allow multiple > types of sink constructor to be initialized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16810) Refactor registerSinks with multiple constructos
[ https://issues.apache.org/jira/browse/SPARK-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16810: Assignee: (was: Apache Spark) > Refactor registerSinks with multiple constructos > > > Key: SPARK-16810 > URL: https://issues.apache.org/jira/browse/SPARK-16810 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu > > For some metrics, it may require some app detailed information from > SparkConf. So for those sinks, we need to pass SparkConf via sink > constructor. Refactored registerSink class reflection part to allow multiple > types of sink constructor to be initialized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16810) Refactor registerSinks with multiple constructos
[ https://issues.apache.org/jira/browse/SPARK-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16810: Assignee: Apache Spark > Refactor registerSinks with multiple constructos > > > Key: SPARK-16810 > URL: https://issues.apache.org/jira/browse/SPARK-16810 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu >Assignee: Apache Spark > > For some metrics, it may require some app detailed information from > SparkConf. So for those sinks, we need to pass SparkConf via sink > constructor. Refactored registerSink class reflection part to allow multiple > types of sink constructor to be initialized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400260#comment-15400260 ] Sylvain Zimmer commented on SPARK-16802: Maybe useful for others: an ugly workaround to avoid this code path is to cast the join as string and do something like this instead: {code} SELECT df1.id1 FROM df1 LEFT OUTER JOIN df2 ON cast(df1.id1 as string) = cast(df2.id2 as string) {code} > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Sylvain Zimmer > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at >
[jira] [Created] (SPARK-16810) Refactor registerSinks with multiple constructos
YangyangLiu created SPARK-16810: --- Summary: Refactor registerSinks with multiple constructos Key: SPARK-16810 URL: https://issues.apache.org/jira/browse/SPARK-16810 Project: Spark Issue Type: Improvement Reporter: YangyangLiu For some metrics, it may require some app detailed information from SparkConf. So for those sinks, we need to pass SparkConf via sink constructor. Refactored registerSink class reflection part to allow multiple types of sink constructor to be initialized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16809) Link Mesos Dispatcher and History Server
[ https://issues.apache.org/jira/browse/SPARK-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16809: Assignee: Apache Spark > Link Mesos Dispatcher and History Server > > > Key: SPARK-16809 > URL: https://issues.apache.org/jira/browse/SPARK-16809 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt >Assignee: Apache Spark > > This is a somewhat duplicate of Spark-13401, but the PR for that JIRA seems > to only implement sandbox linking, not history server linking, which is the > sole scope of this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16809) Link Mesos Dispatcher and History Server
[ https://issues.apache.org/jira/browse/SPARK-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400234#comment-15400234 ] Apache Spark commented on SPARK-16809: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/14414 > Link Mesos Dispatcher and History Server > > > Key: SPARK-16809 > URL: https://issues.apache.org/jira/browse/SPARK-16809 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > This is a somewhat duplicate of Spark-13401, but the PR for that JIRA seems > to only implement sandbox linking, not history server linking, which is the > sole scope of this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16809) Link Mesos Dispatcher and History Server
[ https://issues.apache.org/jira/browse/SPARK-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16809: Assignee: (was: Apache Spark) > Link Mesos Dispatcher and History Server > > > Key: SPARK-16809 > URL: https://issues.apache.org/jira/browse/SPARK-16809 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > This is a somewhat duplicate of Spark-13401, but the PR for that JIRA seems > to only implement sandbox linking, not history server linking, which is the > sole scope of this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16809) Link Mesos Dispatcher and History Server
Michael Gummelt created SPARK-16809: --- Summary: Link Mesos Dispatcher and History Server Key: SPARK-16809 URL: https://issues.apache.org/jira/browse/SPARK-16809 Project: Spark Issue Type: New Feature Components: Mesos Reporter: Michael Gummelt This is a somewhat duplicate of Spark-13401, but the PR for that JIRA seems to only implement sandbox linking, not history server linking, which is the sole scope of this JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16808) History Server main page does not honor APPLICATION_WEB_PROXY_BASE
Michael Gummelt created SPARK-16808: --- Summary: History Server main page does not honor APPLICATION_WEB_PROXY_BASE Key: SPARK-16808 URL: https://issues.apache.org/jira/browse/SPARK-16808 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Michael Gummelt The root of the history server is rendered dynamically with javascript, and this doesn't honor APPLICATION_WEB_PROXY_BASE: https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/historypage-template.html#L67 Other links in the history server do honor it: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/UIUtils.scala#L146 This means the links on the history server root page are broken when deployed behind a proxy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16807) Optimize some ABS() statements
[ https://issues.apache.org/jira/browse/SPARK-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16807: --- Priority: Minor (was: Major) > Optimize some ABS() statements > -- > > Key: SPARK-16807 > URL: https://issues.apache.org/jira/browse/SPARK-16807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sylvain Zimmer >Priority: Minor > > I'm not a Catalyst expert, but I think some use cases for the ABS() function > could generate simpler code. > This is the code generated when doing something like {{ABS(x - y) > 0}} or > {{ABS(x - y) = 0}} in Spark SQL: > {code} > /* 267 */ float filter_value6 = -1.0f; > /* 268 */ filter_value6 = agg_value27 - agg_value32; > /* 269 */ float filter_value5 = -1.0f; > /* 270 */ filter_value5 = (float)(java.lang.Math.abs(filter_value6)); > /* 271 */ > /* 272 */ boolean filter_value4 = false; > /* 273 */ filter_value4 = > org.apache.spark.util.Utils.nanSafeCompareFloats(filter_value5, 0.0f) > 0; > /* 274 */ if (!filter_value4) continue; > {code} > Maybe it could all be simplified to something like this? > {code} > filter_value4 = (agg_value27 != agg_value32) > {code} > (Of course you could write {{x != y}} directly in the SQL query, but the > {{0}} in my example could be a configurable threshold, not something you can > hardcode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16807) Optimize some ABS() statements
[ https://issues.apache.org/jira/browse/SPARK-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16807: --- Description: I'm not a Catalyst expert, but I think some use cases for the ABS() function could generate simpler code. This is the code generated when doing something like {{ABS(x - y) > 0}} or {{ABS(x - y) = 0}} in Spark SQL: {code} /* 267 */ float filter_value6 = -1.0f; /* 268 */ filter_value6 = agg_value27 - agg_value32; /* 269 */ float filter_value5 = -1.0f; /* 270 */ filter_value5 = (float)(java.lang.Math.abs(filter_value6)); /* 271 */ /* 272 */ boolean filter_value4 = false; /* 273 */ filter_value4 = org.apache.spark.util.Utils.nanSafeCompareFloats(filter_value5, 0.0f) > 0; /* 274 */ if (!filter_value4) continue; {code} Maybe it could all be simplified to something like this? {code} filter_value4 = (agg_value27 != agg_value32) {code} (Of course you could write {{x != y}} directly in the SQL query, but the {{0}} in my example could be a configurable threshold, not something you can hardcode) was: I'm not a Catalyst expert, but I think some use cases for the ABS() function could generate simpler code. This is the code generated when doing something like {{ABS(x - y) > 0}} or {{ABS(x - y) = 0}} in Spark SQL: {code} /* 267 */ float filter_value6 = -1.0f; /* 268 */ filter_value6 = agg_value27 - agg_value32; /* 269 */ float filter_value5 = -1.0f; /* 270 */ filter_value5 = (float)(java.lang.Math.abs(filter_value6)); /* 271 */ /* 272 */ boolean filter_value4 = false; /* 273 */ filter_value4 = org.apache.spark.util.Utils.nanSafeCompareFloats(filter_value5, 0.0f) > 0; /* 274 */ if (!filter_value4) continue; {code} Maybe it could all be simplified to something like this? {code} filter_value4 = (agg_value27 != agg_value32) {code} > Optimize some ABS() statements > -- > > Key: SPARK-16807 > URL: https://issues.apache.org/jira/browse/SPARK-16807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Sylvain Zimmer > > I'm not a Catalyst expert, but I think some use cases for the ABS() function > could generate simpler code. > This is the code generated when doing something like {{ABS(x - y) > 0}} or > {{ABS(x - y) = 0}} in Spark SQL: > {code} > /* 267 */ float filter_value6 = -1.0f; > /* 268 */ filter_value6 = agg_value27 - agg_value32; > /* 269 */ float filter_value5 = -1.0f; > /* 270 */ filter_value5 = (float)(java.lang.Math.abs(filter_value6)); > /* 271 */ > /* 272 */ boolean filter_value4 = false; > /* 273 */ filter_value4 = > org.apache.spark.util.Utils.nanSafeCompareFloats(filter_value5, 0.0f) > 0; > /* 274 */ if (!filter_value4) continue; > {code} > Maybe it could all be simplified to something like this? > {code} > filter_value4 = (agg_value27 != agg_value32) > {code} > (Of course you could write {{x != y}} directly in the SQL query, but the > {{0}} in my example could be a configurable threshold, not something you can > hardcode) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16807) Optimize some ABS() statements
Sylvain Zimmer created SPARK-16807: -- Summary: Optimize some ABS() statements Key: SPARK-16807 URL: https://issues.apache.org/jira/browse/SPARK-16807 Project: Spark Issue Type: Improvement Components: SQL Reporter: Sylvain Zimmer I'm not a Catalyst expert, but I think some use cases for the ABS() function could generate simpler code. This is the code generated when doing something like {{ABS(x - y) > 0}} or {{ABS(x - y) = 0}} in Spark SQL: {code} /* 267 */ float filter_value6 = -1.0f; /* 268 */ filter_value6 = agg_value27 - agg_value32; /* 269 */ float filter_value5 = -1.0f; /* 270 */ filter_value5 = (float)(java.lang.Math.abs(filter_value6)); /* 271 */ /* 272 */ boolean filter_value4 = false; /* 273 */ filter_value4 = org.apache.spark.util.Utils.nanSafeCompareFloats(filter_value5, 0.0f) > 0; /* 274 */ if (!filter_value4) continue; {code} Maybe it could all be simplified to something like this? {code} filter_value4 = (agg_value27 != agg_value32) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1121) Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
[ https://issues.apache.org/jira/browse/SPARK-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400167#comment-15400167 ] Apache Spark commented on SPARK-1121: - User 'pwendell' has created a pull request for this issue: https://github.com/apache/spark/pull/37 > Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set > - > > Key: SPARK-1121 > URL: https://issues.apache.org/jira/browse/SPARK-1121 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Patrick Cogan >Assignee: prashant > Fix For: 1.0.0 > > > The reason why this is needed is that in the 0.23.X versions of hadoop-client > the avro dependency is fully excluded: > http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/0.23.10/hadoop-client-0.23.10.pom > In later versions 2.2.X the avro dependency is correctly inherited from > hadoop-common: > http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.2.0/hadoop-client-2.2.0.pom > So as a workaround Spark currently depends on Avro directly in the sbt and > scala builds. This is a bit ugly so I'd like to propose the following: > 1. In the Maven build just remove avro and make a note on the > building-with-maven page that they will need to manually add avro for this > build. > 2. On sbt only add the avro dependency if the version is 0.23.X and > SPARK_YARN is true. Also we only need to add avro not both {avro, avro-ipc} > like is there now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400150#comment-15400150 ] Nattavut Sutyanyong commented on SPARK-16804: - To demonstrate that this fix does not unnecessarily block the "good" cases (where LIMIT is present but NOT on the correlated path), here is an example, which produce the same result set in both with and without this proposed fix. scala> sql("select c1 from t1 where exists (select 1 from (select 1 from t2 limit 1) where t1.c1=t2.c2)").show +---+ | c1| +---+ | 1| +---+ > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16806. --- Resolution: Duplicate > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16794) Spark 2.0.0. with Yarn
[ https://issues.apache.org/jira/browse/SPARK-16794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16794. --- Resolution: Not A Problem > Spark 2.0.0. with Yarn > --- > > Key: SPARK-16794 > URL: https://issues.apache.org/jira/browse/SPARK-16794 > Project: Spark > Issue Type: Question >Affects Versions: 2.0.0 > Environment: AWS Cluster with Hortonworks 2.4 >Reporter: Eliano Marques > > I'm trying to start spark 2.0.0 with yarn. First I had the issues pointed out > here: https://issues.apache.org/jira/browse/SPARK-15343. I then check the > hadoop.yarn.timeline-service.enabled to false and the behaviour changed, i.e. > it started executing the application. However I'm facing the following error. > This might be a silly config but appreciate if you can help. > {code} > spark-shell --master yarn deploy-mode client > {code} > Log: > {code} > 16/07/29 10:04:17 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/29 10:04:17 WARN component.AbstractLifeCycle: FAILED > ServerConnector@16da476c{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:298) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2071) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2062) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:308) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:451) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at >
[jira] [Reopened] (SPARK-16794) Spark 2.0.0. with Yarn
[ https://issues.apache.org/jira/browse/SPARK-16794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-16794: --- This should not be resolved as Fixed because there was no Spark change associated to it > Spark 2.0.0. with Yarn > --- > > Key: SPARK-16794 > URL: https://issues.apache.org/jira/browse/SPARK-16794 > Project: Spark > Issue Type: Question >Affects Versions: 2.0.0 > Environment: AWS Cluster with Hortonworks 2.4 >Reporter: Eliano Marques > > I'm trying to start spark 2.0.0 with yarn. First I had the issues pointed out > here: https://issues.apache.org/jira/browse/SPARK-15343. I then check the > hadoop.yarn.timeline-service.enabled to false and the behaviour changed, i.e. > it started executing the application. However I'm facing the following error. > This might be a silly config but appreciate if you can help. > {code} > spark-shell --master yarn deploy-mode client > {code} > Log: > {code} > 16/07/29 10:04:17 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/29 10:04:17 WARN component.AbstractLifeCycle: FAILED > ServerConnector@16da476c{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:298) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2071) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2062) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:308) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:451) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
[jira] [Closed] (SPARK-16794) Spark 2.0.0. with Yarn
[ https://issues.apache.org/jira/browse/SPARK-16794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eliano Marques closed SPARK-16794. -- Resolution: Fixed > Spark 2.0.0. with Yarn > --- > > Key: SPARK-16794 > URL: https://issues.apache.org/jira/browse/SPARK-16794 > Project: Spark > Issue Type: Question >Affects Versions: 2.0.0 > Environment: AWS Cluster with Hortonworks 2.4 >Reporter: Eliano Marques > > I'm trying to start spark 2.0.0 with yarn. First I had the issues pointed out > here: https://issues.apache.org/jira/browse/SPARK-15343. I then check the > hadoop.yarn.timeline-service.enabled to false and the behaviour changed, i.e. > it started executing the application. However I'm facing the following error. > This might be a silly config but appreciate if you can help. > {code} > spark-shell --master yarn deploy-mode client > {code} > Log: > {code} > 16/07/29 10:04:17 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/29 10:04:17 WARN component.AbstractLifeCycle: FAILED > ServerConnector@16da476c{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:298) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2071) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2062) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:308) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:451) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at > scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638) > at > scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637) > at > scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) > at > scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) > at >
[jira] [Commented] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400118#comment-15400118 ] Dongjoon Hyun commented on SPARK-16806: --- You can see SPARK-16384 , too. > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400113#comment-15400113 ] Dongjoon Hyun edited comment on SPARK-16806 at 7/29/16 10:19 PM: - Yep. Please use `` instead of ``. {code} scala> sql("select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss')").show ... | 1969-12-31 16:01:40| 1969-12-31 16:01:40| {code} was (Author: dongjoon): Yep. Please use `` instead of ``. {code} scala> sql("select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss')").show +---+---+ |from_unixtime(CAST(100 AS BIGINT), -MM-dd HH:mm:ss)|from_unixtime(CAST(100 AS BIGINT), -MM-dd HH:mm:ss)| +---+---+ |1969-12-31 16:01:40| 1969-12-31 16:01:40| +---+---+ {code} > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400113#comment-15400113 ] Dongjoon Hyun edited comment on SPARK-16806 at 7/29/16 10:18 PM: - Yep. Please use `` instead of ``. {code} scala> sql("select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss')").show +---+---+ |from_unixtime(CAST(100 AS BIGINT), -MM-dd HH:mm:ss)|from_unixtime(CAST(100 AS BIGINT), -MM-dd HH:mm:ss)| +---+---+ |1969-12-31 16:01:40| 1969-12-31 16:01:40| +---+---+ {code} was (Author: dongjoon): Yep. Please use `` instead of ``. {code} sql("select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss')").show {code} > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400113#comment-15400113 ] Dongjoon Hyun commented on SPARK-16806: --- Yep. Please use `` instead of ``. {code} sql("select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss')").show {code} > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16806) from_unixtime function gives wrong answer
[ https://issues.apache.org/jira/browse/SPARK-16806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dong Jiang updated SPARK-16806: --- Description: The following is from 2.0, for the same epoch, the function with format argument generates a different result for the year. spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss'); 1969-12-31 19:01:40 1970-12-31 19:01:40 was: The following is from 2.0, for the same epoch, the function with format argument generates a different result. spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss'); 1969-12-31 19:01:40 1970-12-31 19:01:40 > from_unixtime function gives wrong answer > - > > Key: SPARK-16806 > URL: https://issues.apache.org/jira/browse/SPARK-16806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Dong Jiang > > The following is from 2.0, for the same epoch, the function with format > argument generates a different result for the year. > spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd > HH:mm:ss'); > 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16806) from_unixtime function gives wrong answer
Dong Jiang created SPARK-16806: -- Summary: from_unixtime function gives wrong answer Key: SPARK-16806 URL: https://issues.apache.org/jira/browse/SPARK-16806 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Dong Jiang The following is from 2.0, for the same epoch, the function with format argument generates a different result. spark-sql> select from_unixtime(100), from_unixtime(100, '-MM-dd HH:mm:ss'); 1969-12-31 19:01:40 1970-12-31 19:01:40 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400105#comment-15400105 ] Nattavut Sutyanyong commented on SPARK-16804: - This fix blocks any correlated subquery when there is a LIMIT operation on the path from the parent table to the correlated predicate. We may consider relaxing this restriction once we have a better support on processing correlated subquery in run-time. SPARK-13417 is an umbrella task to track this effort. Note that if the LIMIT is not in the correlated path, Spark returns correct result. Example: sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2) and exists (select 1 from t2 LIMIT 1)").show will return both rows from T1, which is correctly handled with and without this proposed fix. This fix will change the behaviour of the query sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show to return an error from the Analysis phase as shown below: org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in a LIMIT: LocalLimit 1 ... > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16794) Spark 2.0.0. with Yarn
[ https://issues.apache.org/jira/browse/SPARK-16794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400094#comment-15400094 ] Eliano Marques commented on SPARK-16794: So I investigated further the issue and turns out that the issues was this: {code} Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher {code} We made it work by adding to the spark-defaults.conf file the following: {code} spark.driver.extraJavaOptions -Dhdp.version=current spark.yarn.am.extraJavaOptions -Dhdp.version=current {code} All python (jupyter and shell), scala (jupyter and shell) and R (shell and rstudio via sparklyr) are now working with 2.0.0. Thanks for your answer again. Eliano > Spark 2.0.0. with Yarn > --- > > Key: SPARK-16794 > URL: https://issues.apache.org/jira/browse/SPARK-16794 > Project: Spark > Issue Type: Question >Affects Versions: 2.0.0 > Environment: AWS Cluster with Hortonworks 2.4 >Reporter: Eliano Marques > > I'm trying to start spark 2.0.0 with yarn. First I had the issues pointed out > here: https://issues.apache.org/jira/browse/SPARK-15343. I then check the > hadoop.yarn.timeline-service.enabled to false and the behaviour changed, i.e. > it started executing the application. However I'm facing the following error. > This might be a silly config but appreciate if you can help. > {code} > spark-shell --master yarn deploy-mode client > {code} > Log: > {code} > 16/07/29 10:04:17 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/07/29 10:04:17 WARN component.AbstractLifeCycle: FAILED > ServerConnector@16da476c{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: > Address already in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:433) > at sun.nio.ch.Net.bind(Net.java:425) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321) > at > org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80) > at > org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.spark_project.jetty.server.Server.doStart(Server.java:366) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at > org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:298) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:308) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2071) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2062) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:308) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:139) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at > org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:451) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.SparkContext.(SparkContext.scala:451) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2256) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95) > at $line3.$read$$iw$$iw.(:15) > at $line3.$read$$iw.(:31) > at $line3.$read.(:33) > at $line3.$read$.(:37) > at $line3.$read$.() > at $line3.$eval$.$print$lzycompute(:7) > at $line3.$eval$.$print(:6) > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786) > at >
[jira] [Comment Edited] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400089#comment-15400089 ] Charles Allen edited comment on SPARK-16798 at 7/29/16 10:06 PM: - Adding some more flavor, this is running in Mesos coarse mode against 0.28.2. If I take a subset of the data that failed and run it locally (local[4] or local[1]), it succeeds, which is annoying. here are the info logs from the failing tasks: {code} 16/07/29 18:19:20 INFO HadoopRDD: Input split: REDACTED1:0+163064 16/07/29 18:19:20 INFO TorrentBroadcast: Started reading broadcast variable 0 16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 18.2 KB, free 3.6 GB) 16/07/29 18:19:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 10 ms 16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 209.2 KB, free 3.6 GB) 16/07/29 18:19:20 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 16/07/29 18:19:20 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 16/07/29 18:19:20 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/07/29 18:19:20 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/07/29 18:19:20 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED1' for reading 16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz] 16/07/29 18:19:21 ERROR Executor: Exception in task 9.0 in stage 0.0 (TID 9) java.lang.IllegalArgumentException: bound must be positive at java.util.Random.nextInt(Random.java:388) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 14 16/07/29 18:19:21 INFO Executor: Running task 14.0 in stage 0.0 (TID 14) 16/07/29 18:19:21 INFO HadoopRDD: Input split: REDACTED2:0+157816 16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED2' for reading 16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz] 16/07/29 18:19:21 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 14) java.lang.IllegalArgumentException: bound must be positive at java.util.Random.nextInt(Random.java:388) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 15 16/07/29 18:19:21 INFO Executor: Running task 9.1 in stage 0.0 (TID 15) {code} was (Author: drcrallen): Adding some more flavor, this is running in Mesos coarse mode against
[jira] [Commented] (SPARK-16779) Fix unnecessary use of postfix operations
[ https://issues.apache.org/jira/browse/SPARK-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400092#comment-15400092 ] holdenk commented on SPARK-16779: - I've gone ahead and done a more full-scope version - leaving the XML postfix operators alone for now. > Fix unnecessary use of postfix operations > - > > Key: SPARK-16779 > URL: https://issues.apache.org/jira/browse/SPARK-16779 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: holdenk > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400089#comment-15400089 ] Charles Allen commented on SPARK-16798: --- Adding some more flavor, this is running in Mesos coarse mode against 0.28.2. If I take a subset of the data that failed and run it locally (local[4] or local[1]), it succeeds, which is annoying. here are the info logs from the failing tasks: {code} 16/07/29 18:19:20 INFO HadoopRDD: Input split: REDACTED1.gz:0+163064 16/07/29 18:19:20 INFO TorrentBroadcast: Started reading broadcast variable 0 16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 18.2 KB, free 3.6 GB) 16/07/29 18:19:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 10 ms 16/07/29 18:19:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 209.2 KB, free 3.6 GB) 16/07/29 18:19:20 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 16/07/29 18:19:20 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 16/07/29 18:19:20 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/07/29 18:19:20 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/07/29 18:19:20 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED1' for reading 16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz] 16/07/29 18:19:21 ERROR Executor: Exception in task 9.0 in stage 0.0 (TID 9) java.lang.IllegalArgumentException: bound must be positive at java.util.Random.nextInt(Random.java:388) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 14 16/07/29 18:19:21 INFO Executor: Running task 14.0 in stage 0.0 (TID 14) 16/07/29 18:19:21 INFO HadoopRDD: Input split: REDACTED2:0+157816 16/07/29 18:19:21 INFO NativeS3FileSystem: Opening 'REDACTED2' for reading 16/07/29 18:19:21 INFO CodecPool: Got brand-new decompressor [.gz] 16/07/29 18:19:21 ERROR Executor: Exception in task 14.0 in stage 0.0 (TID 14) java.lang.IllegalArgumentException: bound must be positive at java.util.Random.nextInt(Random.java:388) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) at org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/07/29 18:19:21 INFO CoarseGrainedExecutorBackend: Got assigned task 15 16/07/29 18:19:21 INFO Executor: Running task 9.1 in stage 0.0 (TID 15) {code} > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 >
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400090#comment-15400090 ] Nattavut Sutyanyong commented on SPARK-16804: - scala> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").explain(true) == Parsed Logical Plan == 'Project ['c1] +- 'Filter exists#21 : +- 'SubqueryAlias exists#21 : +- 'GlobalLimit 1 :+- 'LocalLimit 1 : +- 'Project [unresolvedalias(1, None)] : +- 'Filter ('t1.c1 = 't2.c2) : +- 'UnresolvedRelation `t2` +- 'UnresolvedRelation `t1` == Analyzed Logical Plan == c1: int Project [c1#17] +- Filter predicate-subquery#21 [(c1#17 = c2#10)] : +- SubqueryAlias predicate-subquery#21 [(c1#17 = c2#10)] <== This correlated predicate is incorrectly moved above the LIMIT : +- GlobalLimit 1 :+- LocalLimit 1 : +- Project [1 AS 1#26, c2#10] : +- SubqueryAlias t2 : +- Project [value#8 AS c2#10] :+- LocalRelation [value#8] +- SubqueryAlias t1 +- Project [value#15 AS c1#17] +- LocalRelation [value#15] By rewriting the correlated predicate in the subquery in Analysis phase from below the LIMIT 1 operation to above it causing the scan of the subquery table to return only 1 row. The correct semantic is the LIMIT 1 must be applied on the subquery for each input value from the parent table. > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16805) Log timezone when query result does not match
[ https://issues.apache.org/jira/browse/SPARK-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400087#comment-15400087 ] Apache Spark commented on SPARK-16805: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/14413 > Log timezone when query result does not match > - > > Key: SPARK-16805 > URL: https://issues.apache.org/jira/browse/SPARK-16805 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > It is useful to log the timezone when query result does not match, especially > on build machines that have different timezone from AMPLab Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16805) Log timezone when query result does not match
[ https://issues.apache.org/jira/browse/SPARK-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16805: Assignee: Apache Spark (was: Reynold Xin) > Log timezone when query result does not match > - > > Key: SPARK-16805 > URL: https://issues.apache.org/jira/browse/SPARK-16805 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > It is useful to log the timezone when query result does not match, especially > on build machines that have different timezone from AMPLab Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16805) Log timezone when query result does not match
[ https://issues.apache.org/jira/browse/SPARK-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16805: Assignee: Reynold Xin (was: Apache Spark) > Log timezone when query result does not match > - > > Key: SPARK-16805 > URL: https://issues.apache.org/jira/browse/SPARK-16805 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > It is useful to log the timezone when query result does not match, especially > on build machines that have different timezone from AMPLab Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16804: Description: Correlated subqueries with LIMIT could return incorrect results. The rule ResolveSubquery in the Analysis phase moves correlated predicates to a join predicates and neglect the semantic of the LIMIT. Example: {noformat} Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show +---+ | c1| +---+ | 1| +---+ {noformat} The correct result contains both rows from T1. was: Correlated subqueries with LIMIT could return incorrect results. The rule ResolveSubquery in the Analysis phase moves correlated predicates to a join predicates and neglect the semantic of the LIMIT. Example: Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show +---+ | c1| +---+ | 1| +---+ The correct result contains both rows from T1. > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16805) Log timezone when query result does not match
Reynold Xin created SPARK-16805: --- Summary: Log timezone when query result does not match Key: SPARK-16805 URL: https://issues.apache.org/jira/browse/SPARK-16805 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin It is useful to log the timezone when query result does not match, especially on build machines that have different timezone from AMPLab Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400083#comment-15400083 ] Clark Fitzgerald commented on SPARK-16785: -- Yes I can do a PR for this. It may take me a little while though. > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15355) Pro-active block replenishment in case of node/executor failures
[ https://issues.apache.org/jira/browse/SPARK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15355: Assignee: (was: Apache Spark) > Pro-active block replenishment in case of node/executor failures > > > Key: SPARK-15355 > URL: https://issues.apache.org/jira/browse/SPARK-15355 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Shubham Chopra > > Spark currently does not replenish lost replicas. For resiliency and high > availability, BlockManagerMasterEndpoint can proactively verify whether all > cached RDDs have enough replicas, and replenish them, in case they don’t. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15355) Pro-active block replenishment in case of node/executor failures
[ https://issues.apache.org/jira/browse/SPARK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15355: Assignee: Apache Spark > Pro-active block replenishment in case of node/executor failures > > > Key: SPARK-15355 > URL: https://issues.apache.org/jira/browse/SPARK-15355 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Shubham Chopra >Assignee: Apache Spark > > Spark currently does not replenish lost replicas. For resiliency and high > availability, BlockManagerMasterEndpoint can proactively verify whether all > cached RDDs have enough replicas, and replenish them, in case they don’t. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15355) Pro-active block replenishment in case of node/executor failures
[ https://issues.apache.org/jira/browse/SPARK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400078#comment-15400078 ] Apache Spark commented on SPARK-15355: -- User 'shubhamchopra' has created a pull request for this issue: https://github.com/apache/spark/pull/14412 > Pro-active block replenishment in case of node/executor failures > > > Key: SPARK-15355 > URL: https://issues.apache.org/jira/browse/SPARK-15355 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Shubham Chopra > > Spark currently does not replenish lost replicas. For resiliency and high > availability, BlockManagerMasterEndpoint can proactively verify whether all > cached RDDs have enough replicas, and replenish them, in case they don’t. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400075#comment-15400075 ] Nattavut Sutyanyong commented on SPARK-16804: - I am working on a fix. Could you please assign this JIRA to nsyca? > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400070#comment-15400070 ] Charles Allen commented on SPARK-16798: --- I am definitely running a *modified* 2.0.0, but modifications are in the scheduler, not the RDD paths. Right now I'm running 1.5.2_2.11 through the deployment system to get as close to apples-to-apples as I can (and so that workflows can be swapped between the two ad-hoc) > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400071#comment-15400071 ] Charles Allen commented on SPARK-16798: --- I'll run 2.0.0 stock as another test that will go out during this push. > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400068#comment-15400068 ] Apache Spark commented on SPARK-16804: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/14411 > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16804: Assignee: (was: Apache Spark) > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16804: Assignee: Apache Spark > Correlated subqueries containing LIMIT return incorrect results > --- > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400045#comment-15400045 ] Miao Wang commented on SPARK-16802: --- Very easy to reproduce. I am learning SQL code now. > joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException > > > Key: SPARK-16802 > URL: https://issues.apache.org/jira/browse/SPARK-16802 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Sylvain Zimmer > > Hello! > This is a little similar to > [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I > have reopened it?). > I would recommend to give another full review to {{HashedRelation.scala}}, > particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors > that I haven't managed to reproduce so far, as well as what I suspect could > be memory leaks (I have a query in a loop OOMing after a few iterations > despite not caching its results). > Here is the script to reproduce the ArrayIndexOutOfBoundsException on the > current 2.0 branch: > {code} > import os > import random > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > def randlong(): > return random.randint(-9223372036854775808, 9223372036854775807) > while True: > l1, l2 = randlong(), randlong() > # Sample values that crash: > # l1, l2 = 4661454128115150227, -5543241376386463808 > print "Testing with %s, %s" % (l1, l2) > data1 = [(l1, ), (l2, )] > data2 = [(l1, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > sqlc.registerDataFrameAsTable(df1, "df1") > sqlc.registerDataFrameAsTable(df2, "df2") > result_df = sqlc.sql(""" > SELECT > df1.id1 > FROM df1 > LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 > """) > print result_df.collect() > {code} > {code} > java.lang.ArrayIndexOutOfBoundsException: 1728150825 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at
[jira] [Created] (SPARK-16804) Correlated subqueries containing LIMIT return incorrect results
Nattavut Sutyanyong created SPARK-16804: --- Summary: Correlated subqueries containing LIMIT return incorrect results Key: SPARK-16804 URL: https://issues.apache.org/jira/browse/SPARK-16804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Nattavut Sutyanyong Correlated subqueries with LIMIT could return incorrect results. The rule ResolveSubquery in the Analysis phase moves correlated predicates to a join predicates and neglect the semantic of the LIMIT. Example: Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 1)").show +---+ | c1| +---+ | 1| +---+ The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400029#comment-15400029 ] Apache Spark commented on SPARK-16803: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14410 > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {noformat} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {noformat} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16803: Assignee: Apache Spark > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {noformat} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {noformat} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16803: Assignee: (was: Apache Spark) > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {noformat} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {noformat} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16789) Can't run saveAsTable with database name
[ https://issues.apache.org/jira/browse/SPARK-16789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400025#comment-15400025 ] Xiao Li commented on SPARK-16789: - Please check the description of the JIRA: https://issues.apache.org/jira/browse/SPARK-16803 Thanks! > Can't run saveAsTable with database name > > > Key: SPARK-16789 > URL: https://issues.apache.org/jira/browse/SPARK-16789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: CentOS 7 JDK 1.8 Hive 1.2.1 >Reporter: SonixLegend > > The function "saveAsTable" with database and table name is running via 1.6.2 > successfully. But when I upgrade 2.0, it's got the error. There are my code > and error message. Can you help me? > conf/hive-site.xml > > hive.metastore.uris > thrift://localhost:9083 > > val spark = > SparkSession.builder().appName("SparkHive").enableHiveSupport().getOrCreate() > import spark.implicits._ > import spark.sql > val source = sql("select * from sample.sample") > source.createOrReplaceTempView("test") > source.collect.foreach{tuple => println(tuple(0) + ":" + tuple(1))} > val target = sql("select key, 'Spark' as value from test") > println(target.count()) > target.write.mode(SaveMode.Append).saveAsTable("sample.sample") > spark.stop() > Exception in thread "main" org.apache.spark.sql.AnalysisException: Saving > data in MetastoreRelation sample, sample > is not supported.; > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:218) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:378) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354) > at com.newtouch.sample.SparkHive$.main(SparkHive.scala:25) > at com.newtouch.sample.SparkHive.main(SparkHive.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16789) Can't run saveAsTable with database name
[ https://issues.apache.org/jira/browse/SPARK-16789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li closed SPARK-16789. --- Resolution: Duplicate > Can't run saveAsTable with database name > > > Key: SPARK-16789 > URL: https://issues.apache.org/jira/browse/SPARK-16789 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: CentOS 7 JDK 1.8 Hive 1.2.1 >Reporter: SonixLegend > > The function "saveAsTable" with database and table name is running via 1.6.2 > successfully. But when I upgrade 2.0, it's got the error. There are my code > and error message. Can you help me? > conf/hive-site.xml > > hive.metastore.uris > thrift://localhost:9083 > > val spark = > SparkSession.builder().appName("SparkHive").enableHiveSupport().getOrCreate() > import spark.implicits._ > import spark.sql > val source = sql("select * from sample.sample") > source.createOrReplaceTempView("test") > source.collect.foreach{tuple => println(tuple(0) + ":" + tuple(1))} > val target = sql("select key, 'Spark' as value from test") > println(target.count()) > target.write.mode(SaveMode.Append).saveAsTable("sample.sample") > spark.stop() > Exception in thread "main" org.apache.spark.sql.AnalysisException: Saving > data in MetastoreRelation sample, sample > is not supported.; > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:218) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:378) > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:354) > at com.newtouch.sample.SparkHive$.main(SparkHive.scala:25) > at com.newtouch.sample.SparkHive.main(SparkHive.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16803: Description: {noformat} scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value") res2: org.apache.spark.sql.DataFrame = [] scala> val df = sql("select key, value as value from sample.sample") df: org.apache.spark.sql.DataFrame = [key: int, value: string] scala> df.write.mode("append").saveAsTable("sample.sample") scala> sql("select * from sample.sample").show() +---+-+ |key|value| +---+-+ | 1| abc| | 1| abc| +---+-+ {noformat} In Spark 1.6, it works, but Spark 2.0 does not work. The error message from Spark 2.0 is {noformat} scala> df.write.mode("append").saveAsTable("sample.sample") org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample is not supported.; {noformat} So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it internally uses {{insertInto}}. But, if we change it back it will break the semantic of {{saveAsTable}} (this method uses by-name resolution instead of using by-position resolution used by {{insertInto}}). Instead, users should use {{insertInto}} API. We should correct the error messages. Users can understand how to bypass it before we support it. was: {noformat} scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value") res2: org.apache.spark.sql.DataFrame = [] scala> val df = sql("select key, value as value from sample.sample") df: org.apache.spark.sql.DataFrame = [key: int, value: string] scala> df.write.mode("append").saveAsTable("sample.sample") scala> sql("select * from sample.sample").show() +---+-+ |key|value| +---+-+ | 1| abc| | 1| abc| +---+-+ {noformat} In Spark 1.6, it works, but Spark 2.0 does not work. The error message from Spark 2.0 is {{noformat}} scala> df.write.mode("append").saveAsTable("sample.sample") org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample is not supported.; {{noformat}} So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it internally uses {{insertInto}}. But, if we change it back it will break the semantic of {{saveAsTable}} (this method uses by-name resolution instead of using by-position resolution used by {{insertInto}}). Instead, users should use {{insertInto}} API. We should correct the error messages. Users can understand how to bypass it before we support it. > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {noformat} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {noformat} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
[ https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16803: Description: {noformat} scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value") res2: org.apache.spark.sql.DataFrame = [] scala> val df = sql("select key, value as value from sample.sample") df: org.apache.spark.sql.DataFrame = [key: int, value: string] scala> df.write.mode("append").saveAsTable("sample.sample") scala> sql("select * from sample.sample").show() +---+-+ |key|value| +---+-+ | 1| abc| | 1| abc| +---+-+ {noformat} In Spark 1.6, it works, but Spark 2.0 does not work. The error message from Spark 2.0 is {{noformat}} scala> df.write.mode("append").saveAsTable("sample.sample") org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample is not supported.; {{noformat}} So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it internally uses {{insertInto}}. But, if we change it back it will break the semantic of {{saveAsTable}} (this method uses by-name resolution instead of using by-position resolution used by {{insertInto}}). Instead, users should use {{insertInto}} API. We should correct the error messages. Users can understand how to bypass it before we support it. was: {{noformat}} scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value") res2: org.apache.spark.sql.DataFrame = [] scala> val df = sql("select key, value as value from sample.sample") df: org.apache.spark.sql.DataFrame = [key: int, value: string] scala> df.write.mode("append").saveAsTable("sample.sample") scala> sql("select * from sample.sample").show() +---+-+ |key|value| +---+-+ | 1| abc| | 1| abc| +---+-+ {{noformat}} In Spark 1.6, it works, but Spark 2.0 does not work. The error message from Spark 2.0 is {{noformat}} scala> df.write.mode("append").saveAsTable("sample.sample") org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample is not supported.; {{noformat}} So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it internally uses {{insertInto}}. But, if we change it back it will break the semantic of {{saveAsTable}} (this method uses by-name resolution instead of using by-position resolution used by {{insertInto}}). Instead, users should use {{insertInto}} API. We should correct the error messages. Users can understand how to bypass it before we support it. > SaveAsTable does not work when source DataFrame is built on a Hive Table > > > Key: SPARK-16803 > URL: https://issues.apache.org/jira/browse/SPARK-16803 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > {noformat} > scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as > key, 'abc' as value") > res2: org.apache.spark.sql.DataFrame = [] > scala> val df = sql("select key, value as value from sample.sample") > df: org.apache.spark.sql.DataFrame = [key: int, value: string] > scala> df.write.mode("append").saveAsTable("sample.sample") > scala> sql("select * from sample.sample").show() > +---+-+ > |key|value| > +---+-+ > | 1| abc| > | 1| abc| > +---+-+ > {noformat} > In Spark 1.6, it works, but Spark 2.0 does not work. The error message from > Spark 2.0 is > {{noformat}} > scala> df.write.mode("append").saveAsTable("sample.sample") > org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation > sample, sample > is not supported.; > {{noformat}} > So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it > internally uses {{insertInto}}. But, if we change it back it will break the > semantic of {{saveAsTable}} (this method uses by-name resolution instead of > using by-position resolution used by {{insertInto}}). > Instead, users should use {{insertInto}} API. We should correct the error > messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table
Xiao Li created SPARK-16803: --- Summary: SaveAsTable does not work when source DataFrame is built on a Hive Table Key: SPARK-16803 URL: https://issues.apache.org/jira/browse/SPARK-16803 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li {{noformat}} scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value") res2: org.apache.spark.sql.DataFrame = [] scala> val df = sql("select key, value as value from sample.sample") df: org.apache.spark.sql.DataFrame = [key: int, value: string] scala> df.write.mode("append").saveAsTable("sample.sample") scala> sql("select * from sample.sample").show() +---+-+ |key|value| +---+-+ | 1| abc| | 1| abc| +---+-+ {{noformat}} In Spark 1.6, it works, but Spark 2.0 does not work. The error message from Spark 2.0 is {{noformat}} scala> df.write.mode("append").saveAsTable("sample.sample") org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample is not supported.; {{noformat}} So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it internally uses {{insertInto}}. But, if we change it back it will break the semantic of {{saveAsTable}} (this method uses by-name resolution instead of using by-position resolution used by {{insertInto}}). Instead, users should use {{insertInto}} API. We should correct the error messages. Users can understand how to bypass it before we support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dhruve Ashar updated SPARK-16441: - Affects Version/s: 2.0.0 > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400015#comment-15400015 ] Sean Owen commented on SPARK-16798: --- Yeah I noticed that too, and while the conclusion is probably the same, I wasn't sure why we didn't see a different message there. You are definitely running unmodified 2.0.0? > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16798) java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2
[ https://issues.apache.org/jira/browse/SPARK-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400014#comment-15400014 ] Charles Allen commented on SPARK-16798: --- The super odd thing here is that RDD.scala:445 *SHOULD* be protected by the check introduced in https://github.com/apache/spark/pull/13282 , but for some reason it does not seem to be. > java.lang.IllegalArgumentException: bound must be positive : Worked in 1.5.2 > > > Key: SPARK-16798 > URL: https://issues.apache.org/jira/browse/SPARK-16798 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > Code at https://github.com/metamx/druid-spark-batch which was working under > 1.5.2 has ceased to function under 2.0.0 with the below stacktrace. > {code} > java.lang.IllegalArgumentException: bound must be positive > at java.util.Random.nextInt(Random.java:388) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:445) > at > org.apache.spark.rdd.RDD$$anonfun$coalesce$1$$anonfun$9.apply(RDD.scala:444) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:807) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion
[ https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400010#comment-15400010 ] Sean Owen commented on SPARK-14818: --- [~yhuai] do you want to implement this now? > Move sketch and mllibLocal out from mima exclusion > -- > > Key: SPARK-14818 > URL: https://issues.apache.org/jira/browse/SPARK-14818 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Priority: Blocker > > In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see > the definition of mimaProjects). After we release 2.0, we should move them > out from this list. So, we can check binary compatibility for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16667) Spark driver executor dont release unused memory
[ https://issues.apache.org/jira/browse/SPARK-16667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16667. --- Resolution: Not A Problem > Spark driver executor dont release unused memory > > > Key: SPARK-16667 > URL: https://issues.apache.org/jira/browse/SPARK-16667 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu wily 64 bits > java 1.8 > 3 slaves(4GB) 1 master(2GB) virtual machines in Vmware over i7 4th generation > with 16 gb RAM) >Reporter: Luis Angel Hernández Acosta > > I'm running spark app in standalone cluster. My app create sparkContext and > make many calculation with graphx over the time. To calculate, my app create > new java thread and wait for it's ending signal. Betwenn any calculation, > memory grows 50mb-100mb. I make a thread to be sure that any object created > for calculate is destryed after calculate's end, but memory still growing. I > tray stoping the sparkContext and all executor memory allocated by app is > freed but my driver's memory still growing same 50m-100mb. > My graph calculaiton include hdfs seralization of rdd and load graph from hdfs > Spark env: > export SPARK_MASTER_IP=master > export SPARK_WORKER_CORES=4 > export SPARK_WORKER_MEMORY=2919m > export SPARK_WORKER_INSTANCES=1 > export SPARK_DAEMON_MEMORY=256m > export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true > -Dspark.worker.cleanup.interval=10" > That are my only configurations -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16725. --- Resolution: Won't Fix > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16702) Driver hangs after executors are lost
[ https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16702. --- Resolution: Duplicate > Driver hangs after executors are lost > - > > Key: SPARK-16702 > URL: https://issues.apache.org/jira/browse/SPARK-16702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Angus Gerry > Attachments: SparkThreadsBlocked.txt > > > It's my first time, please be kind. > I'm still trying to debug this error locally - at this stage I'm pretty > convinced that it's a weird deadlock/livelock problem due to the use of > {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is > possibly tangentially related to the issues discussed in SPARK-1560 around > the use of blocking calls within locks. > h4. Observed Behavior > When running a spark job, and executors are lost, the job occassionally goes > into a state where it makes no progress with tasks. Most commonly it seems > that the issue occurs when executors are preempted by yarn, but I'm not > confident enough to state that it's restricted to just this scenario. > Upon inspecting a thread dump from the driver, the following stack traces > seem noteworthy (a full thread dump is attached): > {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {noformat} > {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) > scala.Option.foreach(Option.scala:257) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) > org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >
[jira] [Resolved] (SPARK-16688) OpenHashSet.MAX_CAPACITY is always based on Int even when using Long
[ https://issues.apache.org/jira/browse/SPARK-16688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16688. --- Resolution: Won't Fix > OpenHashSet.MAX_CAPACITY is always based on Int even when using Long > > > Key: SPARK-16688 > URL: https://issues.apache.org/jira/browse/SPARK-16688 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Ben McCann > > MAX_CAPACITY is hardcoded to a value of 1073741824: > {code}val MAX_CAPACITY = 1 << 30 > class LongHasher extends Hasher[Long] { > override def hash(o: Long): Int = (o ^ (o >>> 32)).toInt > }{code} > I'd like to stick more than 1B items in my hashmap. Spark's all about big > data, right? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16705) Kafka Direct Stream is not experimental anymore
[ https://issues.apache.org/jira/browse/SPARK-16705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16705. --- Resolution: Duplicate > Kafka Direct Stream is not experimental anymore > --- > > Key: SPARK-16705 > URL: https://issues.apache.org/jira/browse/SPARK-16705 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: Bertrand Dechoux >Priority: Trivial > > http://spark.apache.org/docs/latest/streaming-kafka-integration.html > {quote} > Note that this is an experimental feature introduced in Spark 1.3 for the > Scala and Java API, in Spark 1.4 for the Python API. > {quote} > The feature was indeed marked as experimental for spark 1.3 but is not > anymore. > * > https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html > * > https://spark.apache.org/docs/1.6.2/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16802: --- Description: Hello! This is a little similar to [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I have reopened it?). I would recommend to give another full review to {{HashedRelation.scala}}, particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors that I haven't managed to reproduce so far, as well as what I suspect could be memory leaks (I have a query in a loop OOMing after a few iterations despite not caching its results). Here is the script to reproduce the ArrayIndexOutOfBoundsException on the current 2.0 branch: {code} import os import random from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) schema1 = SparkTypes.StructType([ SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) ]) schema2 = SparkTypes.StructType([ SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) ]) def randlong(): return random.randint(-9223372036854775808, 9223372036854775807) while True: l1, l2 = randlong(), randlong() # Sample values that crash: # l1, l2 = 4661454128115150227, -5543241376386463808 print "Testing with %s, %s" % (l1, l2) data1 = [(l1, ), (l2, )] data2 = [(l1, )] df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) crash = True if crash: os.system("rm -rf /tmp/sparkbug") df1.write.parquet("/tmp/sparkbug/vertex") df2.write.parquet("/tmp/sparkbug/edge") df1 = sqlc.read.load("/tmp/sparkbug/vertex") df2 = sqlc.read.load("/tmp/sparkbug/edge") sqlc.registerDataFrameAsTable(df1, "df1") sqlc.registerDataFrameAsTable(df2, "df2") result_df = sqlc.sql(""" SELECT df1.id1 FROM df1 LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 """) print result_df.collect() {code} {code} java.lang.ArrayIndexOutOfBoundsException: 1728150825 at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) at org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/07/29 20:19:00 WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 50, localhost): java.lang.ArrayIndexOutOfBoundsException: 1728150825 at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463)
[jira] [Updated] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16802: --- Description: Hello! This is a little similar to [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I have reopened it?). I would recommend to give another full review to {{HashedRelation.scala}}, particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors that I haven't managed to reproduce so far, as well as what I suspect could be memory leaks (I have a query in a loop OOMing after a few iterations despite not caching its results). Here is the script to reproduce the ArrayIndexOutOfBoundsException on the current 2.0 branch: {code} import os import random from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) schema1 = SparkTypes.StructType([ SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True), # SparkTypes.StructField("weight1", SparkTypes.FloatType(), nullable=True) ]) schema2 = SparkTypes.StructType([ SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True), # SparkTypes.StructField("weight2", SparkTypes.LongType(), nullable=True) ]) def randlong(): return random.randint(-9223372036854775808, 9223372036854775807) while True: l1, l2 = randlong(), randlong() # Sample values that crash: # l1, l2 = 4661454128115150227, -5543241376386463808 print "Testing with %s, %s" % (l1, l2) data1 = [(l1, ), (l2, )] data2 = [(l1, )] df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) crash = True if crash: os.system("rm -rf /tmp/sparkbug") df1.write.parquet("/tmp/sparkbug/vertex") df2.write.parquet("/tmp/sparkbug/edge") df1 = sqlc.read.load("/tmp/sparkbug/vertex") df2 = sqlc.read.load("/tmp/sparkbug/edge") sqlc.registerDataFrameAsTable(df1, "df1") sqlc.registerDataFrameAsTable(df2, "df2") result_df = sqlc.sql(""" SELECT df1.id1 FROM df1 LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 """) print result_df.collect() {code} {code} java.lang.ArrayIndexOutOfBoundsException: 1728150825 at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) at org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/07/29 20:19:00 WARN TaskSetManager: Lost task 0.0 in stage 17.0 (TID 50, localhost):
[jira] [Commented] (SPARK-16801) clearThreshold does not work for SparseVector
[ https://issues.apache.org/jira/browse/SPARK-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399953#comment-15399953 ] Sean Owen commented on SPARK-16801: --- {{clearThreshold}} is unrelated to vectors entirely. It's a threshold on the dot product. You'd have to provide a specific example, but I don't think this can be the actual problem if there is one. > clearThreshold does not work for SparseVector > - > > Key: SPARK-16801 > URL: https://issues.apache.org/jira/browse/SPARK-16801 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.2 >Reporter: Rahul Shah >Priority: Minor > > LogisticRegression model of mllib library performs randomly when passed with > an SparseVector instead of DenseVector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16802) joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException
Sylvain Zimmer created SPARK-16802: -- Summary: joins.LongToUnsafeRowMap crashes with ArrayIndexOutOfBoundsException Key: SPARK-16802 URL: https://issues.apache.org/jira/browse/SPARK-16802 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 2.0.1 Reporter: Sylvain Zimmer Hello! This is a little similar to [SPARK-16740|https://issues.apache.org/jira/browse/SPARK-16740] (should I have reopened it?). I would recommend to give another full review to {{HashedRelation.scala}}, particularly the new {{LongToUnsafeRowMap}} code. I've had a few other errors that I haven't managed to reproduce so far, as well as what I suspect could be memory leaks (I have a query in a loop OOMing after a few iterations despite not caching its results). Here is the script to reproduce the ArrayIndexOutOfBoundsException on the current 2.0 branch: {{code}} import os import random from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) schema1 = SparkTypes.StructType([ SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True), # SparkTypes.StructField("weight1", SparkTypes.FloatType(), nullable=True) ]) schema2 = SparkTypes.StructType([ SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True), # SparkTypes.StructField("weight2", SparkTypes.LongType(), nullable=True) ]) def randlong(): return random.randint(-9223372036854775808, 9223372036854775807) while True: l1, l2 = randlong(), randlong() # Sample values that crash: # l1, l2 = 4661454128115150227, -5543241376386463808 print "Testing with %s, %s" % (l1, l2) data1 = [(l1, ), (l2, )] data2 = [(l1, )] df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) crash = True if crash: os.system("rm -rf /tmp/sparkbug") df1.write.parquet("/tmp/sparkbug/vertex") df2.write.parquet("/tmp/sparkbug/edge") df1 = sqlc.read.load("/tmp/sparkbug/vertex") df2 = sqlc.read.load("/tmp/sparkbug/edge") sqlc.registerDataFrameAsTable(df1, "df1") sqlc.registerDataFrameAsTable(df2, "df2") result_df = sqlc.sql(""" SELECT df1.id1 FROM df1 LEFT OUTER JOIN df2 ON df1.id1 = df2.id2 """) print result_df.collect() {{code}} {{code}} java.lang.ArrayIndexOutOfBoundsException: 1728150825 at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) at org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:112) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:112) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:899) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1898) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (SPARK-16664) Spark 1.6.2 - Persist call on Data frames with more than 200 columns is wiping out the data.
[ https://issues.apache.org/jira/browse/SPARK-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16664: -- Fix Version/s: 1.6.3 > Spark 1.6.2 - Persist call on Data frames with more than 200 columns is > wiping out the data. > > > Key: SPARK-16664 > URL: https://issues.apache.org/jira/browse/SPARK-16664 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2 >Reporter: Satish Kolli >Assignee: Wesley Tang >Priority: Blocker > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > Calling persist on a data frame with more than 200 columns is removing the > data from the data frame. This is an issue in Spark 1.6.2. Works with out any > issues in Spark 1.6.1 > Following test case demonstrates problem. Please let me know if you need any > additional information. Thanks. > {code} > import org.apache.spark._ > import org.apache.spark.rdd.RDD > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{Row, SQLContext} > import org.scalatest.FunSuite > class TestSpark162_1 extends FunSuite { > test("test data frame with 200 columns") { > val sparkConfig = new SparkConf() > val parallelism = 5 > sparkConfig.set("spark.default.parallelism", s"$parallelism") > sparkConfig.set("spark.sql.shuffle.partitions", s"$parallelism") > val sc = new SparkContext(s"local[3]", "TestNestedJson", sparkConfig) > val sqlContext = new SQLContext(sc) > // create dataframe with 200 columns and fake 200 values > val size = 200 > val rdd: RDD[Seq[Long]] = sc.parallelize(Seq(Seq.range(0, size))) > val rowRdd: RDD[Row] = rdd.map(d => Row.fromSeq(d)) > val schemas = List.range(0, size).map(a => StructField("name"+ a, > LongType, true)) > val testSchema = StructType(schemas) > val testDf = sqlContext.createDataFrame(rowRdd, testSchema) > // test value > assert(testDf.persist.take(1).apply(0).toSeq(100).asInstanceOf[Long] == > 100) > sc.stop() > } > test("test data frame with 201 columns") { > val sparkConfig = new SparkConf() > val parallelism = 5 > sparkConfig.set("spark.default.parallelism", s"$parallelism") > sparkConfig.set("spark.sql.shuffle.partitions", s"$parallelism") > val sc = new SparkContext(s"local[3]", "TestNestedJson", sparkConfig) > val sqlContext = new SQLContext(sc) > // create dataframe with 201 columns and fake 201 values > val size = 201 > val rdd: RDD[Seq[Long]] = sc.parallelize(Seq(Seq.range(0, size))) > val rowRdd: RDD[Row] = rdd.map(d => Row.fromSeq(d)) > val schemas = List.range(0, size).map(a => StructField("name"+ a, > LongType, true)) > val testSchema = StructType(schemas) > val testDf = sqlContext.createDataFrame(rowRdd, testSchema) > // test value > assert(testDf.persist.take(1).apply(0).toSeq(100).asInstanceOf[Long] == > 100) > sc.stop() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16797) Repartiton call w/ 0 partitions drops data
[ https://issues.apache.org/jira/browse/SPARK-16797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399948#comment-15399948 ] Dongjoon Hyun commented on SPARK-16797: --- Hi, [~bjeffrey]. Could you give me more information how to find that? For me, it seems to raise a proper exception like the following. {code} scala> spark.sparkContext.parallelize(1 to 10).repartition(0) java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive. {code} > Repartiton call w/ 0 partitions drops data > -- > > Key: SPARK-16797 > URL: https://issues.apache.org/jira/browse/SPARK-16797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: Bryan Jeffrey >Priority: Minor > Labels: easyfix > > When you call RDD.repartition(0) or DStream.repartition(0), the input data > silently dropped. This should not silently fail; instead an exception should > be thrown to alert the user to the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16796) Visible passwords on Spark environment page
[ https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16796: Assignee: Apache Spark > Visible passwords on Spark environment page > --- > > Key: SPARK-16796 > URL: https://issues.apache.org/jira/browse/SPARK-16796 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Artur >Assignee: Apache Spark >Priority: Trivial > Attachments: > Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch > > > Spark properties (passwords): > spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword > are visible in Web UI in environment page. > Can we mask them from Web UI? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16796) Visible passwords on Spark environment page
[ https://issues.apache.org/jira/browse/SPARK-16796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399944#comment-15399944 ] Apache Spark commented on SPARK-16796: -- User 'Devian-ua' has created a pull request for this issue: https://github.com/apache/spark/pull/14409 > Visible passwords on Spark environment page > --- > > Key: SPARK-16796 > URL: https://issues.apache.org/jira/browse/SPARK-16796 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Artur >Priority: Trivial > Attachments: > Mask_spark_ssl_keyPassword_spark_ssl_keyStorePassword_spark_ssl_trustStorePassword_from_We1.patch > > > Spark properties (passwords): > spark.ssl.keyPassword,spark.ssl.keyStorePassword,spark.ssl.trustStorePassword > are visible in Web UI in environment page. > Can we mask them from Web UI? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org