[jira] [Updated] (SPARK-16621) Generate stable SQLs in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16621: --- Assignee: Dongjoon Hyun > Generate stable SQLs in SQLBuilder > -- > > Key: SPARK-16621 > URL: https://issues.apache.org/jira/browse/SPARK-16621 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.1.0 > > > Currently, the generated SQLs have not-stable IDs for generated attributes. > The stable generated SQL will give more benefit for understanding or testing > the queries. This issue provides stable SQL generation by the followings. > * Provide unique ids for generated subqueries, `gen_subquery_xxx`. > * Provide unique and stable ids for generated attributes, `gen_attr_xxx`. > **Before** > {code} > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS > gen_subquery_0 > {code} > **After** > {code} > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16621) Generate stable SQLs in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-16621. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14257 [https://github.com/apache/spark/pull/14257] > Generate stable SQLs in SQLBuilder > -- > > Key: SPARK-16621 > URL: https://issues.apache.org/jira/browse/SPARK-16621 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Dongjoon Hyun > Fix For: 2.1.0 > > > Currently, the generated SQLs have not-stable IDs for generated attributes. > The stable generated SQL will give more benefit for understanding or testing > the queries. This issue provides stable SQL generation by the followings. > * Provide unique ids for generated subqueries, `gen_subquery_xxx`. > * Provide unique and stable ids for generated attributes, `gen_attr_xxx`. > **Before** > {code} > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS > gen_subquery_0 > {code} > **After** > {code} > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395063#comment-15395063 ] Yin Huai commented on SPARK-16748: -- Seems we should should not wrap the NPE in a TreeNodeException. > Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY > clause > --- > > Key: SPARK-16748 > URL: https://issues.apache.org/jira/browse/SPARK-16748 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > {code} > import org.apache.spark.sql.functions._ > val myUDF = udf((c: String) => s"""${c.take(5)}""") > spark.sql("SELECT cast(null as string) as > a").select(myUDF($"a").as("b")).orderBy($"b").collect > {code} > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange rangepartitioning(b#345 ASC, 200) > +- *Project [UDF(null) AS b#345] >+- Scan OneRowRelation[] > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) > at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-16748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16748: - Description: {code} import org.apache.spark.sql.functions._ val myUDF = udf((c: String) => s"""${c.take(5)}""") spark.sql("SELECT cast(null as string) as a").select(myUDF($"a").as("b")).orderBy($"b").collect {code} {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(b#345 ASC, 200) +- *Project [UDF(null) AS b#345] +- Scan OneRowRelation[] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) {code} was: {code} import org.apache.spark.sql.functions._ val myUDF = udf((c: String) => s"""${c.take(5)}""") spark.sql("SELECT cast(null as string) as a").select(myUDF($"a").as("b")).orderBy($"b").collect {code} {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(b#345 ASC, 200) +- *Project [UDF(null) AS b#345] +- Scan OneRowRelation[] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at
[jira] [Created] (SPARK-16748) Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause
Yin Huai created SPARK-16748: Summary: Errors thrown by UDFs cause TreeNodeException when the query has an ORDER BY clause Key: SPARK-16748 URL: https://issues.apache.org/jira/browse/SPARK-16748 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai {code} import org.apache.spark.sql.functions._ val myUDF = udf((c: String) => s"""${c.take(5)}""") spark.sql("SELECT cast(null as string) as a").select(myUDF($"a").as("b")).orderBy($"b").collect {code} {code} org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange rangepartitioning(b#345 ASC, 200) +- *Project [UDF(null) AS b#345] +- Scan OneRowRelation[] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:113) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:361) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:272) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 16.0 failed 4 times, most recent failure: Lost task 0.3 in stage 16.0 (TID 862, 10.0.250.152): java.lang.NullPointerException at scala.collection.immutable.StringOps$.slice$extension(StringOps.scala:42) at scala.collection.immutable.StringOps.slice(StringOps.scala:31) at scala.collection.IndexedSeqOptimized$class.take(IndexedSeqOptimized.scala:132) at scala.collection.immutable.StringOps.take(StringOps.scala:31) at line2663e8ad718d4a0f8a897d026b83f86035.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:47) at line2663e8ad718d4a0f8a897d026b83f86035.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:47) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41) at org.apache.spark.RangePartitioner$$anonfun$9.apply(Partitioner.scala:261) at
[jira] [Updated] (SPARK-16734) Make sure examples in all language bindings are consistent
[ https://issues.apache.org/jira/browse/SPARK-16734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16734: - Target Version/s: 2.0.1, 2.1.0 (was: 2.0.0, 2.0.1, 2.1.0) > Make sure examples in all language bindings are consistent > -- > > Key: SPARK-16734 > URL: https://issues.apache.org/jira/browse/SPARK-16734 > Project: Spark > Issue Type: Sub-task > Components: Examples, SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16747) MQTT as a streaming source for SQL Streaming.
[ https://issues.apache.org/jira/browse/SPARK-16747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-16747. --- Resolution: Won't Fix > MQTT as a streaming source for SQL Streaming. > - > > Key: SPARK-16747 > URL: https://issues.apache.org/jira/browse/SPARK-16747 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Prashant Sharma >Assignee: Prashant Sharma > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16747) MQTT as a streaming source for SQL Streaming.
[ https://issues.apache.org/jira/browse/SPARK-16747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395046#comment-15395046 ] Prashant Sharma edited comment on SPARK-16747 at 7/27/16 5:05 AM: -- This was created by mistake. was (Author: prashant_): This created by mistake. > MQTT as a streaming source for SQL Streaming. > - > > Key: SPARK-16747 > URL: https://issues.apache.org/jira/browse/SPARK-16747 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Prashant Sharma >Assignee: Prashant Sharma > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16747) MQTT as a streaming source for SQL Streaming.
[ https://issues.apache.org/jira/browse/SPARK-16747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395046#comment-15395046 ] Prashant Sharma commented on SPARK-16747: - This created by mistake. > MQTT as a streaming source for SQL Streaming. > - > > Key: SPARK-16747 > URL: https://issues.apache.org/jira/browse/SPARK-16747 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Prashant Sharma >Assignee: Prashant Sharma > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16747) MQTT as a streaming source for SQL Streaming.
Prashant Sharma created SPARK-16747: --- Summary: MQTT as a streaming source for SQL Streaming. Key: SPARK-16747 URL: https://issues.apache.org/jira/browse/SPARK-16747 Project: Spark Issue Type: New Feature Components: SQL, Streaming Reporter: Prashant Sharma Assignee: Prashant Sharma -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395018#comment-15395018 ] Apache Spark commented on SPARK-15194: -- User 'praveendareddy21' has created a pull request for this issue: https://github.com/apache/spark/pull/14375 > Add Python ML API for MultivariateGaussian > -- > > Key: SPARK-15194 > URL: https://issues.apache.org/jira/browse/SPARK-15194 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > We have a PySpark API for the MLLib version but not the ML version. This > would allow Python's `GaussianMixture` to more closely match the Scala API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout
[ https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongyao Zhao updated SPARK-16746: - Description: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {code:borderStyle=solid} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {code} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. was: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {code:title=code|borderStyle=solid} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {code} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. > Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs > timeout > --- > > Key: SPARK-16746 > URL: https://issues.apache.org/jira/browse/SPARK-16746 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Hongyao Zhao >Priority: Minor > > I wrote a spark streaming program which consume 1000 messages from one topic > of Kafka, did some transformation, and wrote the result back to another > topic. But only found 988 messages in the second topic. I checked log info > and confirmed all messages was received by receivers. But I found a hdfs > writing time out message printed from Class BatchedWriteAheadLog. > > I checkout source code and found code like this: > > {code:borderStyle=solid} > /** Add received block. This event will get written to the write ahead > log (if enabled). */ > def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { > try { >
[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout
[ https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongyao Zhao updated SPARK-16746: - Description: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {code:title=code|borderStyle=solid} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {code} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. was: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {code:title=Bar.scala|borderStyle=solid} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {code} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. > Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs > timeout > --- > > Key: SPARK-16746 > URL: https://issues.apache.org/jira/browse/SPARK-16746 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Hongyao Zhao >Priority: Minor > > I wrote a spark streaming program which consume 1000 messages from one topic > of Kafka, did some transformation, and wrote the result back to another > topic. But only found 988 messages in the second topic. I checked log info > and confirmed all messages was received by receivers. But I found a hdfs > writing time out message printed from Class BatchedWriteAheadLog. > > I checkout source code and found code like this: > > {code:title=code|borderStyle=solid} > /** Add received block. This event will get written to the write ahead > log (if enabled). */ > def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean
[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout
[ https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongyao Zhao updated SPARK-16746: - Description: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {code:title=Bar.scala|borderStyle=solid} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {code} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. was: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {quote} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {quote} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. > Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs > timeout > --- > > Key: SPARK-16746 > URL: https://issues.apache.org/jira/browse/SPARK-16746 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Hongyao Zhao >Priority: Minor > > I wrote a spark streaming program which consume 1000 messages from one topic > of Kafka, did some transformation, and wrote the result back to another > topic. But only found 988 messages in the second topic. I checked log info > and confirmed all messages was received by receivers. But I found a hdfs > writing time out message printed from Class BatchedWriteAheadLog. > > I checkout source code and found code like this: > > {code:title=Bar.scala|borderStyle=solid} > /** Add received block. This event will get written to the write ahead > log (if enabled). */ > def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { > try { >
[jira] [Updated] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout
[ https://issues.apache.org/jira/browse/SPARK-16746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongyao Zhao updated SPARK-16746: - Description: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: {quote} /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } {quote} It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. was: I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: ``` /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } ``` It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. > Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs > timeout > --- > > Key: SPARK-16746 > URL: https://issues.apache.org/jira/browse/SPARK-16746 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: Hongyao Zhao >Priority: Minor > > I wrote a spark streaming program which consume 1000 messages from one topic > of Kafka, did some transformation, and wrote the result back to another > topic. But only found 988 messages in the second topic. I checked log info > and confirmed all messages was received by receivers. But I found a hdfs > writing time out message printed from Class BatchedWriteAheadLog. > > I checkout source code and found code like this: > > {quote} > /** Add received block. This event will get written to the write ahead > log (if enabled). */ > def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { > try { > val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))
[jira] [Created] (SPARK-16746) Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout
Hongyao Zhao created SPARK-16746: Summary: Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout Key: SPARK-16746 URL: https://issues.apache.org/jira/browse/SPARK-16746 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.6.1 Reporter: Hongyao Zhao Priority: Minor I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I found a hdfs writing time out message printed from Class BatchedWriteAheadLog. I checkout source code and found code like this: ``` /** Add received block. This event will get written to the write ahead log (if enabled). */ def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = { try { val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo)) if (writeResult) { synchronized { getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo } logDebug(s"Stream ${receivedBlockInfo.streamId} received " + s"block ${receivedBlockInfo.blockStoreResult.blockId}") } else { logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " + s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.") } writeResult } catch { case NonFatal(e) => logError(s"Error adding block $receivedBlockInfo", e) false } } ``` It seems that ReceiverTracker tries to write block info to hdfs, but the write operation time out, this cause writeToLog function return false, and this code "getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo" is skipped. so the block info is lost. The spark version I use is 1.6.1 and I did not turn on spark.streaming.receiver.writeAheadLog.enable. I want to know whether or not this is a designed behaviour. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395003#comment-15395003 ] Yanbo Liang commented on SPARK-16446: - Will send PR soon. > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)
Joe Chong created SPARK-16745: - Summary: Spark job completed however have to wait for 13 mins (data size is small) Key: SPARK-16745 URL: https://issues.apache.org/jira/browse/SPARK-16745 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.6.1 Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014 Reporter: Joe Chong Priority: Minor I submitted a job in scala spark shell to show a DataFrame. The data size is about 43K. The job was successful in the end, but took more than 13 minutes to resolve. Upon checking the log, there's multiple exception raised on "Failed to check existence of class" with a java.net.connectionexpcetion message indicating timeout trying to connect to the port 52067, the repl port that Spark setup. Please assist to troubleshoot. Thanks. Started Spark in standalone mode $ spark-shell --driver-memory 5g --master local[*] 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: joechong 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(joechong); users with modify permissions: Set(joechong) 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT 16/07/26 21:05:30 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:52067 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class server' on port 52067. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: joechong 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(joechong); users with modify permissions: Set(joechong) 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' on port 52072. 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started 16/07/26 21:05:35 INFO Remoting: Starting remoting 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074] 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 52074. 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 3.4 GB 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT 16/07/26 21:05:36 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at http://10.199.29.218:4040 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host localhost 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: http://10.199.29.218:52067 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075. 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on 52075 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register BlockManager 16/07/26 21:05:36 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:52075 with 3.4 GB RAM, BlockManagerId(driver, localhost, 52075) 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Registered BlockManager 16/07/26 21:05:36 INFO repl.SparkILoop: Created spark context.. Spark context available as sc. 16/07/26 21:05:37 INFO hive.HiveContext: Initializing execution hive, version 1.2.1 16/07/26 21:05:37 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0 16/07/26 21:05:37 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0 16/07/26 21:05:38 INFO metastore.HiveMetaStore: 0: Opening
[jira] [Updated] (SPARK-16666) Kryo encoder for custom complex classes
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1: --- Description: I'm trying to create a dataset with some geo data using spark and esri. If `Foo` only have `Point` field, it'll work but if I add some other fields beyond a `Point`, I get ArrayIndexOutOfBoundsException. {code:scala} import com.esri.core.geometry.Point import org.apache.spark.sql.{Encoder, Encoders, SQLContext} import org.apache.spark.{SparkConf, SparkContext} object Main { case class Foo(position: Point, name: String) object MyEncoders { implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] } def main(args: Array[String]): Unit = { val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local")) val sqlContext = new SQLContext(sc) import MyEncoders.{FooEncoder, PointEncoder} import sqlContext.implicits._ Seq(new Foo(new Point(0, 0), "bar")).toDS.show } } {code} {noformat} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at org.apache.spark.sql.Dataset.show(Dataset.scala:230) at org.apache.spark.sql.Dataset.show(Dataset.scala:193) at org.apache.spark.sql.Dataset.show(Dataset.scala:201) at Main$.main(Main.scala:24) at Main.main(Main.scala) {noformat} was: I'm trying to create a dataset with some geo data using spark and esri. If `Foo` only have `Point` field, it'll work but if I add some other fields beyond a `Point`, I get ArrayIndexOutOfBoundsException. {code:scala} import com.esri.core.geometry.Point import org.apache.spark.sql.{Encoder, Encoders, SQLContext} import org.apache.spark.{SparkConf, SparkContext} object Main { case class Foo(position: Point, name: String) object MyEncoders { implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] } def main(args: Array[String]): Unit = { val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local")) val sqlContext = new SQLContext(sc) import MyEncoders.{FooEncoder, PointEncoder} import sqlContext.implicits._ Seq(new Foo(new Point(0, 0), "bar")).toDS.show } } {code} {noformat} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at org.apache.spark.sql.Dataset.show(Dataset.scala:230) at org.apache.spark.sql.Dataset.show(Dataset.scala:193) at org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
[jira] [Updated] (SPARK-16666) Kryo encoder for custom complex classes
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1: --- Description: I'm trying to create a dataset with some geo data using spark and esri. If `Foo` only have `Point` field, it'll work but if I add some other fields beyond a `Point`, I get ArrayIndexOutOfBoundsException. {code:scala} import com.esri.core.geometry.Point import org.apache.spark.sql.{Encoder, Encoders, SQLContext} import org.apache.spark.{SparkConf, SparkContext} object Main { case class Foo(position: Point, name: String) object MyEncoders { implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] } def main(args: Array[String]): Unit = { val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local")) val sqlContext = new SQLContext(sc) import MyEncoders.{FooEncoder, PointEncoder} import sqlContext.implicits._ Seq(new Foo(new Point(0, 0), "bar")).toDS.show } } {code} {noformat} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at org.apache.spark.sql.Dataset.show(Dataset.scala:230) at org.apache.spark.sql.Dataset.show(Dataset.scala:193) at org.apache.spark.sql.Dataset.show(Dataset.scala:201) at Main$.main(Main.scala:24)at Main.main(Main.scala) {noformat} was: I'm trying to create a dataset with some geo data using spark and esri. If `Foo` only have `Point` field, it'll work but if I add some other fields beyond a `Point`, I get ArrayIndexOutOfBoundsException. {code:scala} import com.esri.core.geometry.Point import org.apache.spark.sql.{Encoder, Encoders, SQLContext} import org.apache.spark.{SparkConf, SparkContext} object Main { case class Foo(position: Point, name: String) object MyEncoders { implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point] implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo] } def main(args: Array[String]): Unit = { val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local")) val sqlContext = new SQLContext(sc) import MyEncoders.{FooEncoder, PointEncoder} import sqlContext.implicits._ Seq(new Foo(new Point(0, 0), "bar")).toDS.show } } {code} {quote} Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70) at org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69) at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at org.apache.spark.sql.Dataset.show(Dataset.scala:230) at org.apache.spark.sql.Dataset.show(Dataset.scala:193) at org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
[jira] [Resolved] (SPARK-16524) Add RowBatch and RowBasedHashMapGenerator
[ https://issues.apache.org/jira/browse/SPARK-16524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16524. - Resolution: Fixed Assignee: Qifan Pu Fix Version/s: 2.1.0 > Add RowBatch and RowBasedHashMapGenerator > - > > Key: SPARK-16524 > URL: https://issues.apache.org/jira/browse/SPARK-16524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Qifan Pu >Assignee: Qifan Pu > Fix For: 2.1.0 > > > This JIRA adds the implementations for `RowBatch` and > `RowBasedHashMapGenerator`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394875#comment-15394875 ] Marcelo Vanzin edited comment on SPARK-16725 at 7/27/16 1:04 AM: - There is a clear guideline for when you need a library that is different from what Spark provides: shade the library in your applications. Is it optimal? No. Does it cause jar bloat? Yes. Does it suck? Totally. *But it works.* For everybody. Right now. Without having to wait for Spark's dependencies to fix their own dependency mess, or for Spark to fix its own. Some people can use the "userClassPathFirst" options to achieve similar things without shading, but that doesn't work for everyone. was (Author: vanzin): There is a clear guideline for when you need a library that is different from what Spark provides: shade the library in your applications. Is it optimal? No. Does it cause jar bloat? Yes. Does it suck? Totally. *But it works.* For everybody. Right now. Without having to wait for Spark's downstream dependencies to fix their own dependency mess, or for Spark to fix its own. Some people can use the "userClassPathFirst" options to achieve similar things without shading, but that doesn't work for everyone. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394875#comment-15394875 ] Marcelo Vanzin commented on SPARK-16725: There is a clear guideline for when you need a library that is different from what Spark provides: shade the library in your applications. Is it optimal? No. Does it cause jar bloat? Yes. Does it suck? Totally. *But it works.* For everybody. Right now. Without having to wait for Spark's downstream dependencies to fix their own dependency mess, or for Spark to fix its own. Some people can use the "userClassPathFirst" options to achieve similar things without shading, but that doesn't work for everyone. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394864#comment-15394864 ] Liang Ke commented on SPARK-16735: -- https://github.com/apache/spark/pull/14374 > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16715) Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
[ https://issues.apache.org/jira/browse/SPARK-16715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394860#comment-15394860 ] Liang Ke commented on SPARK-16715: -- Sorry, mistake the jira id. > Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic > equals and hash" > > > Key: SPARK-16715 > URL: https://issues.apache.org/jira/browse/SPARK-16715 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.1, 2.1.0 > > > SubexpressionEliminationSuite."Semantic equals and hash" assumes the default > AttributeReference's exprId wont' be "ExprId(1)". However, that depends on > when this test runs. It may happen to use "ExprId(1)". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16744) spark.yarn.appMasterEnv handling assumes values should be appended
[ https://issues.apache.org/jira/browse/SPARK-16744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394812#comment-15394812 ] Kevin Grealish commented on SPARK-16744: Issue discovered when doing narrower fix for SPARK-16110 > spark.yarn.appMasterEnv handling assumes values should be appended > -- > > Key: SPARK-16744 > URL: https://issues.apache.org/jira/browse/SPARK-16744 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0 > Environment: YARN >Reporter: Kevin Grealish > > When processing environment variables set via the conf setting > spark.yarn.appMasterEnv.*, if there is an existing value the new value is > appended assuming it is a CLASSPATH like value. This does not work for most > environment variables, for example PYSPARK_PYTHON. > Related: > https://issues.apache.org/jira/browse/MAPREDUCE-6491 > https://issues.apache.org/jira/browse/SPARK-16110 > https://github.com/apache/spark/pull/13824 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16744) spark.yarn.appMasterEnv handling assumes values should be appended
Kevin Grealish created SPARK-16744: -- Summary: spark.yarn.appMasterEnv handling assumes values should be appended Key: SPARK-16744 URL: https://issues.apache.org/jira/browse/SPARK-16744 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.0, 1.6.2, 1.6.1, 1.6.0 Environment: YARN Reporter: Kevin Grealish When processing environment variables set via the conf setting spark.yarn.appMasterEnv.*, if there is an existing value the new value is appended assuming it is a CLASSPATH like value. This does not work for most environment variables, for example PYSPARK_PYTHON. Related: https://issues.apache.org/jira/browse/MAPREDUCE-6491 https://issues.apache.org/jira/browse/SPARK-16110 https://github.com/apache/spark/pull/13824 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16735: Assignee: Apache Spark > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke >Assignee: Apache Spark > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16735: Assignee: (was: Apache Spark) > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16715) Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
[ https://issues.apache.org/jira/browse/SPARK-16715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394783#comment-15394783 ] Apache Spark commented on SPARK-16715: -- User 'biglobster' has created a pull request for this issue: https://github.com/apache/spark/pull/14374 > Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic > equals and hash" > > > Key: SPARK-16715 > URL: https://issues.apache.org/jira/browse/SPARK-16715 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.1, 2.1.0 > > > SubexpressionEliminationSuite."Semantic equals and hash" assumes the default > AttributeReference's exprId wont' be "ExprId(1)". However, that depends on > when this test runs. It may happen to use "ExprId(1)". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394768#comment-15394768 ] Min Wei commented on SPARK-16725: - Still worth to use a newer version of Guava. Looks like the issue of upgrading to Guava v16+ is postponed to Hadoop 3.0 at least. https://issues.apache.org/jira/browse/HADOOP-11319 Hopefully Guava devs. will be more disciplined with its API compatibility. There seems to be quite many JIRAs on Guava. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16621) Generate stable SQLs in SQLBuilder
[ https://issues.apache.org/jira/browse/SPARK-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16621: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-16576 > Generate stable SQLs in SQLBuilder > -- > > Key: SPARK-16621 > URL: https://issues.apache.org/jira/browse/SPARK-16621 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Dongjoon Hyun > > Currently, the generated SQLs have not-stable IDs for generated attributes. > The stable generated SQL will give more benefit for understanding or testing > the queries. This issue provides stable SQL generation by the followings. > * Provide unique ids for generated subqueries, `gen_subquery_xxx`. > * Provide unique and stable ids for generated attributes, `gen_attr_xxx`. > **Before** > {code} > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS > gen_subquery_0 > {code} > **After** > {code} > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL > res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS > gen_subquery_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15232) Add subquery SQL building tests to LogicalPlanToSQLSuite
[ https://issues.apache.org/jira/browse/SPARK-15232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394738#comment-15394738 ] Dongjoon Hyun commented on SPARK-15232: --- Hi, [~hvanhovell]. May I work on this issue? > Add subquery SQL building tests to LogicalPlanToSQLSuite > > > Key: SPARK-15232 > URL: https://issues.apache.org/jira/browse/SPARK-15232 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Priority: Minor > > We currently test subquery SQL building using the {{HiveCompatibilitySuite}}. > The is not desired since SQL building is actually a part of sql/core and > because we are slowly reducing our dependency on Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Ke updated SPARK-16735: - Comment: was deleted (was: hi, if some one can help me to push my patch to github ? thx a lot : )) > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Ke updated SPARK-16735: - Attachment: (was: SPARK-16735.patch) > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394654#comment-15394654 ] Xin Ren edited comment on SPARK-16445 at 7/26/16 10:17 PM: --- I'm still working on it, hopefully by end of this weekend I can submit PR :) I just have a quick question that which parameters should be passed from R command? For fit() of wrapper class, there are many parameters https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-ccb8590441998a896d1b74ca605b56efR62 {code} def fit( formula: String, data: DataFrame, blockSize: Int, layers: Array[Int], initialWeights: Vector, solver: String, seed: Long, maxIter: Int, tol: Double, stepSize: Double ): MultilayerPerceptronClassifierWrapper = { {code} And for R part, should I pass all the parameters from R command? https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-7ede1519b4a56647801b51af33c2dd18R461 I find in the example (http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier), only below parameters are being set, the rest are just usign default values {code} val trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100) {code} was (Author: iamshrek): I'm still working on it, hopefully by end of this weekend I can submit PR :) I just have a quick question that which parameters should be passed from R command? For fit() of wrapper class, there are many parameters https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-ccb8590441998a896d1b74ca605b56efR62 {code} def fit( formula: String, data: DataFrame, blockSize: Int, layers: Array[Int], initialWeights: Vector, solver: String, seed: Long, maxIter: Int, tol: Double, stepSize: Double ): MultilayerPerceptronClassifierWrapper = { {code} And for R part, should I pass all the parameters from R command? https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-7ede1519b4a56647801b51af33c2dd18R461 I find in the example (http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier), only below parameters are being set, the rest are just usign default values {code} val trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100) {code} > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394654#comment-15394654 ] Xin Ren commented on SPARK-16445: - I'm still working on it, hopefully by end of this weekend I can submit PR :) I just have a quick question that which parameters should be passed from R command? For fit() of wrapper class, there are many parameters https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-ccb8590441998a896d1b74ca605b56efR62 {code} def fit( formula: String, data: DataFrame, blockSize: Int, layers: Array[Int], initialWeights: Vector, solver: String, seed: Long, maxIter: Int, tol: Double, stepSize: Double ): MultilayerPerceptronClassifierWrapper = { {code} And for R part, should I pass all the parameters from R command? https://github.com/apache/spark/compare/master...keypointt:SPARK-16445?expand=1#diff-7ede1519b4a56647801b51af33c2dd18R461 I find in the example (http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier), only below parameters are being set, the rest are just usign default values {code} val trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100) {code} > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394652#comment-15394652 ] Sean Owen commented on SPARK-16725: --- Again there are two Guavas here: the unshaded one for Hadoop et al, and the shaded one for Spark. The unshaded one is 14. If you update it, other things break. (In 1.6.x, there are actually unshaded Guava Optional in the assembly for historical reasons, but I don't know if that would trigger the warning.) Spark requires Hadoop classes one way or the other. You certainly have them somewhere or nothing will work. You can build Spark to build in Hadoop classes or have them provided at runtime. Maybe you need the latter. If the runtime doesn't provide the dependency, how can any app work without its dependency bundled? Yes, it's likely that the ideal resolution here is for spark-cassandra to shade Guava to be safe, if it's bundling unshaded Guava. I'm not suggesting it's your change to make, but, not Spark's either. Shading means moving the dependency to a different namespace, not sure hat you mean. It's probably not realistic to get rid of Guava; it's useful. Even if you did the problem is that dependencies may not. I don't think it's random; versions are as high as they can be without apparently breaking compatibility with how other commonly-paired dependencies work. That is, there's a reason it's 14 and not 15. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394644#comment-15394644 ] Sylvain Zimmer commented on SPARK-16740: OK! Looks like that would be [~davies] :-) > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at >
[jira] [Commented] (SPARK-16445) Multilayer Perceptron Classifier wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394642#comment-15394642 ] Xiangrui Meng commented on SPARK-16445: --- [~iamshrek] Any updates? > Multilayer Perceptron Classifier wrapper in SparkR > -- > > Key: SPARK-16445 > URL: https://issues.apache.org/jira/browse/SPARK-16445 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Xin Ren > > Follow instructions in SPARK-16442 and implement multilayer perceptron > classifier wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16446) Gaussian Mixture Model wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394639#comment-15394639 ] Xiangrui Meng commented on SPARK-16446: --- [~yanboliang] Any updates? > Gaussian Mixture Model wrapper in SparkR > > > Key: SPARK-16446 > URL: https://issues.apache.org/jira/browse/SPARK-16446 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > Follow instructions in SPARK-16442 and implement Gaussian Mixture Model > wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16444) Isotonic Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16444: -- Shepherd: Junyang Qian > Isotonic Regression wrapper in SparkR > - > > Key: SPARK-16444 > URL: https://issues.apache.org/jira/browse/SPARK-16444 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Miao Wang > > Implement Isotonic Regression wrapper and other utils in SparkR. > {code} > spark.isotonicRegression(data, formula, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394634#comment-15394634 ] Dongjoon Hyun edited comment on SPARK-16740 at 7/26/16 10:03 PM: - You had better look up the one who made that code with `git blame` command and ask him after passing Jenkins. That is the fastest way to get reviewed. :) was (Author: dongjoon): You had better look up the one you made that with `git blame` command and ask him after passing Jenkins. That is the fastest way to get reviewed. :) > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at >
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394634#comment-15394634 ] Dongjoon Hyun commented on SPARK-16740: --- You had better look up the one you made that with `git blame` command and ask him after passing Jenkins. That is the fastest way to get reviewed. :) > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at >
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394628#comment-15394628 ] Sylvain Zimmer commented on SPARK-16740: Thanks! I just did. Let me know if that's okay. > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at >
[jira] [Assigned] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16740: Assignee: Apache Spark > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Apache Spark > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at >
[jira] [Assigned] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16740: Assignee: (was: Apache Spark) > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at >
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394624#comment-15394624 ] Apache Spark commented on SPARK-16740: -- User 'sylvinus' has created a pull request for this issue: https://github.com/apache/spark/pull/14373 > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at >
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394622#comment-15394622 ] Min Wei commented on SPARK-16725: - I would love to see Spark successful as a platform, and not caught up in some versioning mess down the road. >Spark shades Guava and therefore doesn't leak it. This does not seem to be true. In my case, I only used spark-shell and spark-cassandra in the standalone environment, no Hadoop bits (at least not explicitly). >Spark depends on Hadoop, Hadoop depends on unshaded Guava. It does not seem right that Spark has to provide a jar for Hadoop, the lower level. >shield yourself by shading is pretty good For the context, "I" as the user, am not at "fault" here. I am using Spark, and one dependency component Spark-cassandra. It does not seem right that "I" as the consumer have to do any shading. On a separate note, looks like the Guava versioning specifically is a bit random. Maybe if Spark/Hadoop could not get rid of the dependency, one option is to copy the code with a different namespace, i.e. a more explicit shading. My two cents. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394587#comment-15394587 ] Sean Owen commented on SPARK-16725: --- Thanks for the reminder [~vanzin], I caught up again on the current state here. There's a long story here, but, no, Spark shades Guava and therefore doesn't leak it. However it used to have to 'leak' the Optional class in Guava before 2.x. Spark depends on Hadoop, Hadoop depends on unshaded Guava. Any assembly containing Spark needs Hadoop and some unshaded Guava, therefore. That's what the Guava 14 is about IIUC. I suspect that in principle Spark could both include Guava 14 for other dependencies and use shaded Guava 19 internally, but that could be tricky. I suppose there just hasn't been a compelling reason. Guava isn't backwards compatible for more than a few versions, actually. Fair enough, they're all major releases. But moving the dependency forward in general does break things. The bad news is that, whatever Spark does, you still face leakage from Hadoop or other projects. Hence the advice to shield yourself by shading is pretty good, downsides notwithstanding, because it's a problem that crops up in more than just Spark. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394572#comment-15394572 ] Min Wei commented on SPARK-16725: - First I am not blocked. Plus I can hack anything I need to make it work given the OSS nature. Here I am purely curious how Spark plans versioning management as a more general platform. As it stands, looks like Spark is "leaking" the Guava dependency. I just did a web search and looks like there is quite a bit of energy spent on this: https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/HnTsWJkI5jo https://issues.apache.org/jira/browse/ZEPPELIN-620 My suggestion is that, the Spark platform needs to provide guidelines. So spark-cassandra etc. platform pieces built on top should follow it. Otherwise it will be painful for the upper stack developers or users to consume the whole stack. >Spark has to ship a Guava jar because Hadoop needs it I don't understand this. I assume Hadoop is a dependency for Spark. Spark uses v14 of Guava to shadows v11 in Hadoop? >Changing fro 14 to 16 will fix your use case, but what about someone who wants >a different version? As long as the version is moving forward, not backwards. Of course in this case Guava itself could have done a better job of backwards compatibility. [ >"shade your custom dependencies" works for everyone, Won't this cause code/jar bloat and pain for everyone? > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394555#comment-15394555 ] Dongjoon Hyun commented on SPARK-16740: --- Maybe, you will modify the following line? {code} val range = maxKey - minKey {code} > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) >
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394553#comment-15394553 ] Dongjoon Hyun commented on SPARK-16740: --- Hi, [~sylvinus]. It looks like that. Could you make a PR for this issue? > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at
[jira] [Created] (SPARK-16743) converter and access code out of sync: createDataFrame on RDD[Option[C]] fails with MatchError
Daniel Barclay created SPARK-16743: -- Summary: converter and access code out of sync: createDataFrame on RDD[Option[C]] fails with MatchError Key: SPARK-16743 URL: https://issues.apache.org/jira/browse/SPARK-16743 Project: Spark Issue Type: Bug Affects Versions: 1.6.2, 1.6.1 Reporter: Daniel Barclay Calling {{SqlContext}}'s {{createDataFrame}} on an RDD of type {{RDD\[Option\[SomeUserClass]]}} leads to an internal error. For example, if the first field of {{SomeUserClass}} is of type {{String}}, evaluating the RDD yields a {{MatchError}} referring an instance of {{SomeUserClass}} in {{org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl}} (which should have been passed only certain kinds of representations of strings). The problem seems to be in {{ExistingRDD.scala}}'s {{RDDConversions.productToRowAdd(...)}}: It has a list of converters that reflects the list of members of {{SomeUserClass}} (looking past the {{Option}} part of the RDD record type {{Option\[SomeUserClass]}}). However, the data-access code ({{r.productElement\(i)}}) does not seem to look past the {{Option}} part correspondingly. (It does not seem to also traverse from the Some instance to the {{SomeUserClass)}}.) Therefore, it ends up passing the instance of {{SomeUserClass}} to the converter intended for the first member field of {{SomeUserClass}} (e.g., a String converter), yielding an internal error. (If {{RDD\[Option\[...]]}} doesn't make sense in the first place, it should be rejected with a "conscious" error rather than failing with an internal inconsistency.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16742) Kerberos support for Spark on Mesos
Michael Gummelt created SPARK-16742: --- Summary: Kerberos support for Spark on Mesos Key: SPARK-16742 URL: https://issues.apache.org/jira/browse/SPARK-16742 Project: Spark Issue Type: New Feature Components: Mesos Reporter: Michael Gummelt We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()
[ https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394515#comment-15394515 ] Zoltan Fedor commented on SPARK-16741: -- Most databases allow you to set unique keys and disallow the insert of duplicate values on those keys, but unforuntaly usually you need special SQL in the INSERT (Spark's side) to ignore those errors silently. Unfortunately even if that would work, there would be certain cases when you are just not able to create a unique key (the data itself is not unique). Hence I think the best option is to create an attribute in jdbc.write() which would allow the user to turn off of the effect of speculative mode only for that jdbc.write(). Basically that would allow the user to determine whether or not he wants to overwrite the global speculative mode for this very operation and run without speculative mode and with no duplicates. > spark.speculation causes duplicate rows in df.write.jdbc() > -- > > Key: SPARK-16741 > URL: https://issues.apache.org/jira/browse/SPARK-16741 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2 > Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2 >Reporter: Zoltan Fedor >Priority: Minor > > Since a fix added to Spark 1.6.2 we can write string data back into an Oracle > database, so I went to try it out and found that rows showed up duplicated in > the database table after they got inserted into our Oracle database. > The code we use it very simple: > df = sqlContext.sql("SELECT * FROM example_temp_table") > df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table") > The data in the 'target_table' in the database has twice as many rows as the > 'df' dataframe in SparkSQL. > After some investigation it turns out that this is caused by our > spark.speculation setting is being set to True. > As soon as we turned this off, there were no more duplicates generated. > This somewhat makes sense - spark.speculation causes the map jobs to run 2 > copies - resulting in every row being inserted into our Oracle databases > twice. > Probably the df.jdbc.write() method does not consider a Spark context running > in speculative mode, hence the inserts coming from the speculative map also > get inserted - causing to have every record inserted twice. > Likely that this bug is independent from the database type (we use Oracle) > and whether PySpark is used or Scala or Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394508#comment-15394508 ] Shea Parkes commented on SPARK-12261: - I still can't get this bug to reproduce reliably locally, but I can confirm that my proposed bandaid of exhausting the iterator at the end of {{takeUpToNumLeft()}} made the error go away entirely in manual testing. Would the Spark maintainers be willing to accept such a band-aid into the main codebase, or is this something I'd just need to maintain in our own Spark distributions? (I already maintain a couple other modifications related to Windows mess.) A more root cause fix would require more Scala knowledge than I have. I'd likely propose to define a new constant (e.g. {{SpecialLengths.PYTHON_WORKER_EXITING}}) and put logic into Scala that would make it immediately stop sending new data over the socket... > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()
[ https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16741: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) Yeah, the problem is that it's a global setting. I wonder if it's possible to make your operation idempotent by failing if both inserts happen -- like by having some unique key constraint that the second insert would violate. > spark.speculation causes duplicate rows in df.write.jdbc() > -- > > Key: SPARK-16741 > URL: https://issues.apache.org/jira/browse/SPARK-16741 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2 > Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2 >Reporter: Zoltan Fedor >Priority: Minor > > Since a fix added to Spark 1.6.2 we can write string data back into an Oracle > database, so I went to try it out and found that rows showed up duplicated in > the database table after they got inserted into our Oracle database. > The code we use it very simple: > df = sqlContext.sql("SELECT * FROM example_temp_table") > df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table") > The data in the 'target_table' in the database has twice as many rows as the > 'df' dataframe in SparkSQL. > After some investigation it turns out that this is caused by our > spark.speculation setting is being set to True. > As soon as we turned this off, there were no more duplicates generated. > This somewhat makes sense - spark.speculation causes the map jobs to run 2 > copies - resulting in every row being inserted into our Oracle databases > twice. > Probably the df.jdbc.write() method does not consider a Spark context running > in speculative mode, hence the inserts coming from the speculative map also > get inserted - causing to have every record inserted twice. > Likely that this bug is independent from the database type (we use Oracle) > and whether PySpark is used or Scala or Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()
[ https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394501#comment-15394501 ] Zoltan Fedor commented on SPARK-16741: -- Thanks Sean. I agree, it is debatable whether this is a bug or not. My thinking is that as I don't see a scenario where jdbc.write() would need to be running in speculative mode, it should automatically turn off speculative mode when you call jdbc.write(). jdbc.write() not turning speculative mode off automatically - whether that is a bug or a feature, is debatable. I am okey to change this to a feature request. > spark.speculation causes duplicate rows in df.write.jdbc() > -- > > Key: SPARK-16741 > URL: https://issues.apache.org/jira/browse/SPARK-16741 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2 >Reporter: Zoltan Fedor > > Since a fix added to Spark 1.6.2 we can write string data back into an Oracle > database, so I went to try it out and found that rows showed up duplicated in > the database table after they got inserted into our Oracle database. > The code we use it very simple: > df = sqlContext.sql("SELECT * FROM example_temp_table") > df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table") > The data in the 'target_table' in the database has twice as many rows as the > 'df' dataframe in SparkSQL. > After some investigation it turns out that this is caused by our > spark.speculation setting is being set to True. > As soon as we turned this off, there were no more duplicates generated. > This somewhat makes sense - spark.speculation causes the map jobs to run 2 > copies - resulting in every row being inserted into our Oracle databases > twice. > Probably the df.jdbc.write() method does not consider a Spark context running > in speculative mode, hence the inserts coming from the speculative map also > get inserted - causing to have every record inserted twice. > Likely that this bug is independent from the database type (we use Oracle) > and whether PySpark is used or Scala or Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()
[ https://issues.apache.org/jira/browse/SPARK-16741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394480#comment-15394480 ] Sean Owen commented on SPARK-16741: --- Yeah, because the output isn't idempotent, I'm not sure you can use speculation here. Although the insert occurs in a transaction, it's possible for it to succeed in both tasks before one can be cancelled. I'm not sure it's therefore a bug. > spark.speculation causes duplicate rows in df.write.jdbc() > -- > > Key: SPARK-16741 > URL: https://issues.apache.org/jira/browse/SPARK-16741 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.2 > Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2 >Reporter: Zoltan Fedor > > Since a fix added to Spark 1.6.2 we can write string data back into an Oracle > database, so I went to try it out and found that rows showed up duplicated in > the database table after they got inserted into our Oracle database. > The code we use it very simple: > df = sqlContext.sql("SELECT * FROM example_temp_table") > df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table") > The data in the 'target_table' in the database has twice as many rows as the > 'df' dataframe in SparkSQL. > After some investigation it turns out that this is caused by our > spark.speculation setting is being set to True. > As soon as we turned this off, there were no more duplicates generated. > This somewhat makes sense - spark.speculation causes the map jobs to run 2 > copies - resulting in every row being inserted into our Oracle databases > twice. > Probably the df.jdbc.write() method does not consider a Spark context running > in speculative mode, hence the inserts coming from the speculative map also > get inserted - causing to have every record inserted twice. > Likely that this bug is independent from the database type (we use Oracle) > and whether PySpark is used or Scala or Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions
[ https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394485#comment-15394485 ] Alok Singh commented on SPARK-16611: Hi [~shivaram] Sorry for late reply due to vacation. Here is the detail answers a) lapply or map lapplyPartition or mapPartition We have current pre-processing codes (i.e recode, na impute, outlier etc.) implemented that is different that what spark currently provides (but currently sparkR doesn't have it) and hence we uses the apply on dataframe lapplyPartition was added in the list since we don’t want to re-shuffle data for efficiency and would prefer for certain cases lappyPartition instead of lapply b) flatMap: not the high priority we can live without it :) c) RDD:toRDD, getJRDD (i.e RDD api) Since SystemML uses and works internally in the binary block matrix format and it’s api is based on the JavaRDD apis and out R ext takes sparkR data frame and extract the RDD and pass it to the systemML java api. Note that the use of RDD is discouraged from spark as using data frame enables one to use all the catalyst optimizer features. However, in our case we are sure that we will just get the RDD and convert to the binary block matrix so that systemML can consumes and do the heavy lifting. d) cleanup.jobj: SystemML uses the MLContext and matrixCharacteristic class that is instantiated in JVM and whose object ref is kept alive in the sparkR and later when systemML has done it’s computation. we cleanup the objects. The way we achieve it using the References classes in R and use it’s finalize method to register the cleanup.jobj once we have created the jobj via newJObject(“sysml.class”) In general, I think goal our DataFrame only api is great but removing RDD apis 100% would have many issues with out ext package on the top of sparkR. Can we continue to keep them private (if we can't converse to the decision) for now? For using dapply, one concern, we have is that dapplyInternal always do the broadcast of the variables set via useBroadcast. However, in many cases lapply is what one needs as user know for sure that he will not be using the broadcast vars. Also lapply falls naturally in the R syntax. Thanks Alok > Expose several hidden DataFrame/RDD functions > - > > Key: SPARK-16611 > URL: https://issues.apache.org/jira/browse/SPARK-16611 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Oscar D. Lara Yejas > > Expose the following functions: > - lapply or map > - lapplyPartition or mapPartition > - flatMap > - RDD > - toRDD > - getJRDD > - cleanup.jobj > cc: > [~javierluraschi] [~j...@rstudio.com] [~shivaram] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()
Zoltan Fedor created SPARK-16741: Summary: spark.speculation causes duplicate rows in df.write.jdbc() Key: SPARK-16741 URL: https://issues.apache.org/jira/browse/SPARK-16741 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.2 Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2 Reporter: Zoltan Fedor Since a fix added to Spark 1.6.2 we can write string data back into an Oracle database, so I went to try it out and found that rows showed up duplicated in the database table after they got inserted into our Oracle database. The code we use it very simple: df = sqlContext.sql("SELECT * FROM example_temp_table") df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table") The data in the 'target_table' in the database has twice as many rows as the 'df' dataframe in SparkSQL. After some investigation it turns out that this is caused by our spark.speculation setting is being set to True. As soon as we turned this off, there were no more duplicates generated. This somewhat makes sense - spark.speculation causes the map jobs to run 2 copies - resulting in every row being inserted into our Oracle databases twice. Probably the df.jdbc.write() method does not consider a Spark context running in speculative mode, hence the inserts coming from the speculative map also get inserted - causing to have every record inserted twice. Likely that this bug is independent from the database type (we use Oracle) and whether PySpark is used or Scala or Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394450#comment-15394450 ] Eliano Marques commented on SPARK-8518: --- Would it be possible to add some information about the coefficients standard deviation? According to the information https://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression, it would be possible to bring to the model outputs the second derivate of the gradient function for beta and log teta? Bringing this information to the model outputs would enable us to perform the standard statistical tests on the parameters, similar to what is done in the glm. Thanks > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Fix For: 1.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394437#comment-15394437 ] Sylvain Zimmer edited comment on SPARK-16740 at 7/26/16 7:54 PM: - I'm not a Scala expert but from a quick review of the code, it appears that it's easy to overflow the {{range}} variable in {{optimize()}}: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L608 In my example, that would yield {code} scala> val range = 4661454128115150227L - (-5543241376386463808L) range: Long = -8242048569207937581 {code} Maybe we should add {{range >= 0}} as a condition of doing that optimization? was (Author: sylvinus): I'm not a Scala expert but from a quick review of the code, it appears that it's easy to overflow the {{range}} variable in {{optimize()}}: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L608 In my example, that would yield {{ scala> val range = 4661454128115150227L - (-5543241376386463808L) range: Long = -8242048569207937581 }} Maybe we should add {{range >= 0}} as a condition of doing that optimization? > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at >
[jira] [Commented] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-16740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394437#comment-15394437 ] Sylvain Zimmer commented on SPARK-16740: I'm not a Scala expert but from a quick review of the code, it appears that it's easy to overflow the {{range}} variable in {{optimize()}}: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L608 In my example, that would yield {{ scala> val range = 4661454128115150227L - (-5543241376386463808L) range: Long = -8242048569207937581 }} Maybe we should add {{range >= 0}} as a condition of doing that optimization? > joins.LongToUnsafeRowMap crashes with NegativeArraySizeException > > > Key: SPARK-16740 > URL: https://issues.apache.org/jira/browse/SPARK-16740 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer > > Hello, > Here is a crash in Spark SQL joins, with a minimal reproducible test case. > Interestingly, it only seems to happen when reading Parquet data (I added a > {{crash = True}} variable to show it) > This is an {{left_outer}} example, but it also crashes with a regular > {{inner}} join. > {code} > import os > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > schema1 = SparkTypes.StructType([ > SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) > ]) > schema2 = SparkTypes.StructType([ > SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) > ]) > # Valid Long values (-9223372036854775808 < -5543241376386463808 , > 4661454128115150227 < 9223372036854775807) > data1 = [(4661454128115150227,), (-5543241376386463808,)] > data2 = [(650460285, )] > df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) > df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) > crash = True > if crash: > os.system("rm -rf /tmp/sparkbug") > df1.write.parquet("/tmp/sparkbug/vertex") > df2.write.parquet("/tmp/sparkbug/edge") > df1 = sqlc.read.load("/tmp/sparkbug/vertex") > df2 = sqlc.read.load("/tmp/sparkbug/edge") > result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") > # Should print [Row(id2=650460285, id1=None)] > print result_df.collect() > {code} > When ran with {{spark-submit}}, the final {{collect()}} call crashes with > this: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o61.collectToPython. > : org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) > at > org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at >
[jira] [Created] (SPARK-16740) joins.LongToUnsafeRowMap crashes with NegativeArraySizeException
Sylvain Zimmer created SPARK-16740: -- Summary: joins.LongToUnsafeRowMap crashes with NegativeArraySizeException Key: SPARK-16740 URL: https://issues.apache.org/jira/browse/SPARK-16740 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, SQL Affects Versions: 2.0.0 Reporter: Sylvain Zimmer Hello, Here is a crash in Spark SQL joins, with a minimal reproducible test case. Interestingly, it only seems to happen when reading Parquet data (I added a {{crash = True}} variable to show it) This is an {{left_outer}} example, but it also crashes with a regular {{inner}} join. {code} import os from pyspark import SparkContext from pyspark.sql import types as SparkTypes from pyspark.sql import SQLContext sc = SparkContext() sqlc = SQLContext(sc) schema1 = SparkTypes.StructType([ SparkTypes.StructField("id1", SparkTypes.LongType(), nullable=True) ]) schema2 = SparkTypes.StructType([ SparkTypes.StructField("id2", SparkTypes.LongType(), nullable=True) ]) # Valid Long values (-9223372036854775808 < -5543241376386463808 , 4661454128115150227 < 9223372036854775807) data1 = [(4661454128115150227,), (-5543241376386463808,)] data2 = [(650460285, )] df1 = sqlc.createDataFrame(sc.parallelize(data1), schema1) df2 = sqlc.createDataFrame(sc.parallelize(data2), schema2) crash = True if crash: os.system("rm -rf /tmp/sparkbug") df1.write.parquet("/tmp/sparkbug/vertex") df2.write.parquet("/tmp/sparkbug/edge") df1 = sqlc.read.load("/tmp/sparkbug/vertex") df2 = sqlc.read.load("/tmp/sparkbug/edge") result_df = df2.join(df1, on=(df1.id1 == df2.id2), how="left_outer") # Should print [Row(id2=650460285, id1=None)] print result_df.collect() {code} When ran with {{spark-submit}}, the final {{collect()}} call crashes with this: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o61.collectToPython. : org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.BatchedDataSourceScanExec.consume(ExistingRDD.scala:225) at org.apache.spark.sql.execution.BatchedDataSourceScanExec.doProduce(ExistingRDD.scala:328) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.BatchedDataSourceScanExec.produce(ExistingRDD.scala:225) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394329#comment-15394329 ] Marcelo Vanzin commented on SPARK-16725: Again, if you code depends on a specific version of Guava, you should shade Guava instead of forcing the infrastructure to upgrade. Spark has to ship a Guava jar because Hadoop needs it - it sucks and the Hadoop guys want to fix that in version 3, but we're stuck with that for now. Changing fro 14 to 16 will fix *your* use case, but what about someone who wants a different version? Saying "shade your custom dependencies" works for everyone, even though it might not be the optimal answer. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394311#comment-15394311 ] Min Wei commented on SPARK-16725: - I am not sure if the Hadoop jar is the issue, as I am just using the spark shell for some local testing. I "fixed" my test after upgrading the Guava in the Spark core jar per the diff file. Here is a way to repro it after building the spark-cassandra connector. ./bin/spark-shell --jars ./spark-cassandra-connector-assembly-1.6.0-27-g5760745.jar The following exception goes away after the "upgrade". I agree that there would be other cases where the hadoop jar could cause problems. java.lang.IllegalStateException: Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use. This introduces codec resolution issues and potentially other incompatibility issues in the driver. Please upgrade to Guava 16.01 or later. at com.datastax.driver.core.SanityChecks.checkGuava(SanityChecks.java:62) at com.datastax.driver.core.SanityChecks.check(SanityChecks.java:36) at com.datastax.driver.core.Cluster.(Cluster.java:68) at com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:37) at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:98) at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:163) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:157) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:157) at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:34) at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:60) at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:85) at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:114) at com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:127) at com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:346) at com.datastax.spark.connector.cql.Schema$.tableFromCassandra(Schema.scala:366) at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.tableDef(CassandraTableRowReaderProvider.scala:52) at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef$lzycompute(CassandraTableScanRDD.scala:60) at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tableDef(CassandraTableScanRDD.scala:60) at com.datastax.spark.connector.rdd.CassandraTableRowReaderProvider$class.verify(CassandraTableRowReaderProvider.scala:140) at com.datastax.spark.connector.rdd.CassandraTableScanRDD.verify(CassandraTableScanRDD.scala:60) at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:246) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1280) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.take(RDD.scala:1275) at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:132) at com.datastax.spark.connector.rdd.CassandraRDD.take(CassandraRDD.scala:133) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1315) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.first(RDD.scala:1314) ... 52 elided > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala >
[jira] [Updated] (SPARK-16718) gbm-style treeboost
[ https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16718: -- Description: .As an initial minimal change, we should provide TreeBoost as implemented in GBM for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment was: As an initial minimal change, we should provide TreeBoost as implemented in GBM for L1, L2, and logistic losses: by introducing a new "loss-based" impurity, tree leafs in GBTs can have loss-optimal predictions for their partition of the data. Commit should have evidence of accuracy improvment > gbm-style treeboost > --- > > Key: SPARK-16718 > URL: https://issues.apache.org/jira/browse/SPARK-16718 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Vladimir Feinberg > > .As an initial minimal change, we should provide TreeBoost as implemented in > GBM for L1, L2, and logistic losses: by introducing a new "loss-based" > impurity, tree leafs in GBTs can have loss-optimal predictions for their > partition of the data. > Commit should have evidence of accuracy improvment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-15703. --- Resolution: Fixed Fix Version/s: 2.1.0 > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-15703: -- Priority: Minor (was: Critical) > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Minor > Fix For: 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-15703: -- Issue Type: Improvement (was: Bug) > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394271#comment-15394271 ] Thomas Graves commented on SPARK-15703: --- Using this jira to make lsitenerbus configurable and we can use the other jira SPARK-16441 to fix synchronization. > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Bug > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-15703: -- Assignee: Dhruve Ashar > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-15703: -- Summary: Make ListenerBus event queue size configurable (was: Spark UI doesn't show all tasks as completed when it should) > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Bug > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Critical > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16739) GBTClassifier should be a Classifier, not Predictor
Vladimir Feinberg created SPARK-16739: - Summary: GBTClassifier should be a Classifier, not Predictor Key: SPARK-16739 URL: https://issues.apache.org/jira/browse/SPARK-16739 Project: Spark Issue Type: Improvement Reporter: Vladimir Feinberg Priority: Minor Should probably wait for SPARK-4240 to be resolved first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394214#comment-15394214 ] Marcelo Vanzin commented on SPARK-16725: Spark depends on Guava 14 and shades it. The Guava you see on the classpath is because of Hadoop, and we can't change Hadoop. If the Cassandra connector needs Guava 16, it should shade it like Spark does. > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16736: Assignee: (was: Apache Spark) > remove redundant FileSystem status checks calls from Spark codebase > --- > > Key: SPARK-16736 > URL: https://issues.apache.org/jira/browse/SPARK-16736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Priority: Minor > > The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are > wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS > NN, and very, very slow against object stores. > # if these calls are followed by any getStatus() calls then they can be > eliminated by careful merging and pulling out the catching of > {FileNotFoundException}} from the exists() call to the spark code. > # Any sequence of exists + delete can be optimised by removing the exists > check, relying on {{FileSystem.delete()}} to be a no-op if the destination > path is not present. That's a tested requirement of all Hadoop compatible FS > and object stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394196#comment-15394196 ] Apache Spark commented on SPARK-16736: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/14371 > remove redundant FileSystem status checks calls from Spark codebase > --- > > Key: SPARK-16736 > URL: https://issues.apache.org/jira/browse/SPARK-16736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Priority: Minor > > The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are > wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS > NN, and very, very slow against object stores. > # if these calls are followed by any getStatus() calls then they can be > eliminated by careful merging and pulling out the catching of > {FileNotFoundException}} from the exists() call to the spark code. > # Any sequence of exists + delete can be optimised by removing the exists > check, relying on {{FileSystem.delete()}} to be a no-op if the destination > path is not present. That's a tested requirement of all Hadoop compatible FS > and object stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16736: Assignee: Apache Spark > remove redundant FileSystem status checks calls from Spark codebase > --- > > Key: SPARK-16736 > URL: https://issues.apache.org/jira/browse/SPARK-16736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > > The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are > wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS > NN, and very, very slow against object stores. > # if these calls are followed by any getStatus() calls then they can be > eliminated by careful merging and pulling out the catching of > {FileNotFoundException}} from the exists() call to the spark code. > # Any sequence of exists + delete can be optimised by removing the exists > check, relying on {{FileSystem.delete()}} to be a no-op if the destination > path is not present. That's a tested requirement of all Hadoop compatible FS > and object stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6948) VectorAssembler should choose dense/sparse for output based on number of zeros
[ https://issues.apache.org/jira/browse/SPARK-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394190#comment-15394190 ] Sean Owen commented on SPARK-6948: -- I tend to agree, ran into this just recently in the exact same context. It was surprising when a tiny vector was made sparse and then not usable with StandardScaler (with "subtract mean" enabled). > VectorAssembler should choose dense/sparse for output based on number of zeros > -- > > Key: SPARK-6948 > URL: https://issues.apache.org/jira/browse/SPARK-6948 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Fix For: 1.4.0 > > > Now VectorAssembler only outputs sparse vectors. We should choose > dense/sparse format automatically, whichever uses less memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6948) VectorAssembler should choose dense/sparse for output based on number of zeros
[ https://issues.apache.org/jira/browse/SPARK-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394120#comment-15394120 ] Max Moroz commented on SPARK-6948: -- Should this be a parameter for `transform` method of the VectorAssembler instance (with the current behavior used as default)? For example, I saw people struggle with the `SparseVector` output from VectorAssembler when they wanted to later apply `StandardScaler` with de-meaning. They will end up with an unnecessary roundtrip conversion between `DenseVector` and `SparseVector`. I *think* a similar extra conversion would occur if they later extract some of the features from `VectorAssembler` output, and combine with other features that happen to be dense. In that case, if the user knows in advance that `DenseVector` would be required, they might choose to go with it from the beginning. > VectorAssembler should choose dense/sparse for output based on number of zeros > -- > > Key: SPARK-6948 > URL: https://issues.apache.org/jira/browse/SPARK-6948 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Fix For: 1.4.0 > > > Now VectorAssembler only outputs sparse vectors. We should choose > dense/sparse format automatically, whichever uses less memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
[ https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394121#comment-15394121 ] Daniel Barclay commented on SPARK-16652: Not that I can really tell either, but this does not seem to be generated code that's invalid in the sense of not following the class file format or failing bytecode validation. It just seems to be invalid in the sense of accessing the wrong memory (e.g., something accessed the wrong address, or allocation/deallocation got out of sync): 1. Recall that I initially saw "fault occurred in a recent unsafe memory access" exceptions (before I trimmed the test case down to its current form). 2. *Also, note how changing the string literal on line 63(?) in my test case changes things: - With the current multi-character value, the program crashes. - With a two-character value, it also crashes. - With a one-character value, it runs. - With a zero-character value (empty string), it runs. - With a null value, it runs. (When I was getting the "fault ... in ... unsafe" messages, a one-character value would crash, while an empty string or null would run.) > JVM crash from unsafe memory access for Dataset of class with List[Long] > > > Key: SPARK-16652 > URL: https://issues.apache.org/jira/browse/SPARK-16652 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2 > Environment: Scala 2.10. > JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" > MacOs 10.11.2 >Reporter: Daniel Barclay > Attachments: UnsafeAccessCrashBugTest.scala > > > Generating and writing out a {{Dataset}} of a class that has a {{List}} (at > least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM > crash. > The crash seems to be related to unsafe memory access especially because > earlier code (before I got it reduced to the current bug test case) reported > "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16676) Spark jobs stay in pending
[ https://issues.apache.org/jira/browse/SPARK-16676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394094#comment-15394094 ] Joe Chong commented on SPARK-16676: --- Unfortunately, that's all the logs available in the console, as pasted above. Any idea on where else to look? > Spark jobs stay in pending > -- > > Key: SPARK-16676 > URL: https://issues.apache.org/jira/browse/SPARK-16676 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Shell >Affects Versions: 1.5.2 > Environment: Mac OS X Yosemite, Terminal, Spark-shell standalone >Reporter: Joe Chong > Attachments: Spark UI stays @ pending.png > > > I've been having issues executing certain Scala statements within the > Spark-Shell. These statements are obtained through tutorial/blog written by > Carol McDonald in MapR. > The import statements, reading text files into DataFrames are OK. However, > when I try to do df.show(), the execution hits a road block. Checking the > Spark UI job, I see that the Stage's active, however, 1 of its dependent job > stays in Pending without any movement. The logs are as below. > scala> fltCountsql.show() > 16/07/22 11:40:16 INFO spark.SparkContext: Starting job: show at :46 > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Registering RDD 31 (show at > :46) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Got job 4 (show at > :46) with 200 output partitions > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Final stage: ResultStage > 8(show at :46) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Parents of final stage: > List(ShuffleMapStage 7) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Missing parents: > List(ShuffleMapStage 7) > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 7 > (MapPartitionsRDD[31] at show at :46), which has no missing parents > 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(18128) called > with curMem=115755879, maxMem=2778495713 > 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5 stored as > values in memory (estimated size 17.7 KB, free 2.5 GB) > 16/07/22 11:40:16 INFO storage.MemoryStore: ensureFreeSpace(7527) called with > curMem=115774007, maxMem=2778495713 > 16/07/22 11:40:16 INFO storage.MemoryStore: Block broadcast_5_piece0 stored > as bytes in memory (estimated size 7.4 KB, free 2.5 GB) > 16/07/22 11:40:16 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in > memory on localhost:61408 (size: 7.4 KB, free: 2.5 GB) > 16/07/22 11:40:16 INFO spark.SparkContext: Created broadcast 5 from broadcast > at DAGScheduler.scala:861 > 16/07/22 11:40:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks > from ShuffleMapStage 7 (MapPartitionsRDD[31] at show at :46) > 16/07/22 11:40:16 INFO scheduler.TaskSchedulerImpl: Adding task set 7.0 with > 2 tasks > 16/07/22 11:40:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage > 7.0 (TID 4, localhost, PROCESS_LOCAL, 2156 bytes) > 16/07/22 11:40:16 INFO executor.Executor: Running task 0.0 in stage 7.0 (TID > 4) > 16/07/22 11:40:16 INFO storage.BlockManager: Found block rdd_2_0 locally > 16/07/22 11:40:17 INFO executor.Executor: Finished task 0.0 in stage 7.0 (TID > 4). 2738 bytes result sent to driver > 16/07/22 11:40:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage > 7.0 (TID 4) in 920 ms on localhost (1/2) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16725) Migrate Guava to 16+
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394084#comment-15394084 ] Min Wei commented on SPARK-16725: - Thanks for editing the JIRA fields properly. I used the first/default values in a haste. The diff I attached is meant for illustration purposes only. My intent is to have a discussion of the options. Shading is another option. Personally I would prefer the upgrade/non-shading option, I assume the Guava issue won't be the only troublesome jar down the road. Yes, it might open the can of worms on various versioning management issues. Given the status of Spark as a general platform, it would be a good problem to solve :-) My two cents. > Migrate Guava to 16+ > > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16725) Migrate Guava to 16+?
[ https://issues.apache.org/jira/browse/SPARK-16725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Wei updated SPARK-16725: Summary: Migrate Guava to 16+? (was: Migrate Guava to 16+) > Migrate Guava to 16+? > - > > Key: SPARK-16725 > URL: https://issues.apache.org/jira/browse/SPARK-16725 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Min Wei >Priority: Minor > Original Estimate: 12h > Remaining Estimate: 12h > > Currently Spark depends on an old version of Guava, version 14. However > Spark-cassandra driver asserts on Guava version 16 and above. > It would be great to update the Guava dependency to version 16+ > diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala > b/core/src/main/scala/org/apache/spark/SecurityManager.scala > index f72c7de..abddafe 100644 > --- a/core/src/main/scala/org/apache/spark/SecurityManager.scala > +++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala > @@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom} > import java.security.cert.X509Certificate > import javax.net.ssl._ > > -import com.google.common.hash.HashCodes > +import com.google.common.hash.HashCode > import com.google.common.io.Files > import org.apache.hadoop.io.Text > > @@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf) > val secret = new Array[Byte](length) > rnd.nextBytes(secret) > > -val cookie = HashCodes.fromBytes(secret).toString() > +val cookie = HashCode.fromBytes(secret).toString() > SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, > cookie) > cookie >} else { > diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala > b/core/src/main/scala/org/apache/spark/SparkEnv.scala > index af50a6d..02545ae 100644 > --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala > +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala > @@ -72,7 +72,7 @@ class SparkEnv ( > >// A general, soft-reference map for metadata needed during HadoopRDD > split computation >// (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats). > - private[spark] val hadoopJobMetadata = new > MapMaker().softValues().makeMap[String, Any]() > + private[spark] val hadoopJobMetadata = new > MapMaker().weakValues().makeMap[String, Any]() > >private[spark] var driverTmpDir: Option[String] = None > > diff --git a/pom.xml b/pom.xml > index d064cb5..7c3e036 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -368,8 +368,7 @@ > > com.google.guava > guava > -14.0.1 > -provided > +19.0 > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16652) JVM crash from unsafe memory access for Dataset of class with List[Long]
[ https://issues.apache.org/jira/browse/SPARK-16652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394082#comment-15394082 ] Daniel Barclay commented on SPARK-16652: > This could be reproduced at my side with Spark 1.6.1 and 1.6.2, but not with > the master branch. Yes, I see it working on the master branch too. > JVM crash from unsafe memory access for Dataset of class with List[Long] > > > Key: SPARK-16652 > URL: https://issues.apache.org/jira/browse/SPARK-16652 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2 > Environment: Scala 2.10. > JDK: "Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)" > MacOs 10.11.2 >Reporter: Daniel Barclay > Attachments: UnsafeAccessCrashBugTest.scala > > > Generating and writing out a {{Dataset}} of a class that has a {{List}} (at > least if it's {{List\[Long]}}) member and a {{String}} member causes a JVM > crash. > The crash seems to be related to unsafe memory access especially because > earlier code (before I got it reduced to the current bug test case) reported > "{{java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code}}". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16738) Queryable state for Spark State Store
Mark Sumner created SPARK-16738: --- Summary: Queryable state for Spark State Store Key: SPARK-16738 URL: https://issues.apache.org/jira/browse/SPARK-16738 Project: Spark Issue Type: New Feature Components: Spark Core, SQL, Streaming Affects Versions: 2.0.0 Reporter: Mark Sumner Spark 2.0 will introduce the new State Store to allow state managment outside the RDD model (see: SPARK-13809) This proposal seeks to include a mechanism (in a future release) to expose this internal store to external applications for querying. This would then make it possible to interact with aggregated state without needing to synchronously write (and read) to/from an external store. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15271) Allow force pulling executor docker images
[ https://issues.apache.org/jira/browse/SPARK-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15271. --- Resolution: Fixed Issue resolved by pull request 14348 [https://github.com/apache/spark/pull/14348] > Allow force pulling executor docker images > -- > > Key: SPARK-15271 > URL: https://issues.apache.org/jira/browse/SPARK-15271 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.6.1 >Reporter: Philipp Hoffmann >Assignee: Philipp Hoffmann > Fix For: 2.1.0 > > > Mesos agents by default will not pull docker images which are cached locally > already. > Because of this, in order to run a mutable tag (like {{...:latest}}) from the > current version on the docker repository you have to explicitly tell the > Mesos agent to pull the image (force pull). Otherwise the Mesos agent will > run an old (cached version). > The feature for force pulling the image was introduced in Mesos 0.22: > https://github.com/apache/mesos/commit/8682569df528717ff5efb64da26b1b49c39c4efd > This ticket is about making use of this feature in Spark in order to force > Mesos agents to pull the executors docker image. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16713) Limit codegen method size to 8KB
[ https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16713: Assignee: (was: Apache Spark) > Limit codegen method size to 8KB > > > Key: SPARK-16713 > URL: https://issues.apache.org/jira/browse/SPARK-16713 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Qifan Pu > > Ideally, we would wish codegen methods to be less than 8KB for bytecode size. > Beyond 8K JIT won't compile and can cause performance degradation. We have > seen this for queries with wide schema (30+ fields), where > agg_doAggregateWithKeys() can be more than 8K. This is also a major reason > for performance regression when we enable fash aggregate hashmap (such as > using VectorizedHashMapGenerator.scala). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16713) Limit codegen method size to 8KB
[ https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393863#comment-15393863 ] Apache Spark commented on SPARK-16713: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/14370 > Limit codegen method size to 8KB > > > Key: SPARK-16713 > URL: https://issues.apache.org/jira/browse/SPARK-16713 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Qifan Pu > > Ideally, we would wish codegen methods to be less than 8KB for bytecode size. > Beyond 8K JIT won't compile and can cause performance degradation. We have > seen this for queries with wide schema (30+ fields), where > agg_doAggregateWithKeys() can be more than 8K. This is also a major reason > for performance regression when we enable fash aggregate hashmap (such as > using VectorizedHashMapGenerator.scala). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16713) Limit codegen method size to 8KB
[ https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16713: Assignee: Apache Spark > Limit codegen method size to 8KB > > > Key: SPARK-16713 > URL: https://issues.apache.org/jira/browse/SPARK-16713 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Qifan Pu >Assignee: Apache Spark > > Ideally, we would wish codegen methods to be less than 8KB for bytecode size. > Beyond 8K JIT won't compile and can cause performance degradation. We have > seen this for queries with wide schema (30+ fields), where > agg_doAggregateWithKeys() can be more than 8K. This is also a major reason > for performance regression when we enable fash aggregate hashmap (such as > using VectorizedHashMapGenerator.scala). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393770#comment-15393770 ] Steve Loughran commented on SPARK-16736: See also HIVE-14323 > remove redundant FileSystem status checks calls from Spark codebase > --- > > Key: SPARK-16736 > URL: https://issues.apache.org/jira/browse/SPARK-16736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Priority: Minor > > The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are > wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS > NN, and very, very slow against object stores. > # if these calls are followed by any getStatus() calls then they can be > eliminated by careful merging and pulling out the catching of > {FileNotFoundException}} from the exists() call to the spark code. > # Any sequence of exists + delete can be optimised by removing the exists > check, relying on {{FileSystem.delete()}} to be a no-op if the destination > path is not present. That's a tested requirement of all Hadoop compatible FS > and object stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16736) remove redundant FileSystem status checks calls from Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-16736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-16736: --- Summary: remove redundant FileSystem status checks calls from Spark codebase (was: remove redundant FileSystem.exists() calls from Spark codebase) > remove redundant FileSystem status checks calls from Spark codebase > --- > > Key: SPARK-16736 > URL: https://issues.apache.org/jira/browse/SPARK-16736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Priority: Minor > > The Hadoop {{FileSystem.exists()}} and {{FileSystem.isDirectory()}} calls are > wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS > NN, and very, very slow against object stores. > # if these calls are followed by any getStatus() calls then they can be > eliminated by careful merging and pulling out the catching of > {FileNotFoundException}} from the exists() call to the spark code. > # Any sequence of exists + delete can be optimised by removing the exists > check, relying on {{FileSystem.delete()}} to be a no-op if the destination > path is not present. That's a tested requirement of all Hadoop compatible FS > and object stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16737) ListingFileCatalog comments about RPC calls in object store isn't correct
Steve Loughran created SPARK-16737: -- Summary: ListingFileCatalog comments about RPC calls in object store isn't correct Key: SPARK-16737 URL: https://issues.apache.org/jira/browse/SPARK-16737 Project: Spark Issue Type: Bug Components: SQL Reporter: Steve Loughran Priority: Trivial The comment text which came in with SPARK-1612 says {code} - Although S3/S3A/S3N file system can be quite slow for remote file metadata // operations, calling `getFileBlockLocations` does no harm here since these file system // implementations don't actually issue RPC for this method. {code} That doesn't hold for openstack swift , which can exhibit rack locality if the swift nodes are spread across the same racks as the application. It is safest to remove that comment: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16737) ListingFileCatalog comments about RPC calls in object store isn't correct
[ https://issues.apache.org/jira/browse/SPARK-16737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-16737: --- Description: The comment text which came in with SPARK-16121 says {code} - Although S3/S3A/S3N file system can be quite slow for remote file metadata // operations, calling `getFileBlockLocations` does no harm here since these file system // implementations don't actually issue RPC for this method. {code} That doesn't hold for openstack swift , which can exhibit rack locality if the swift nodes are spread across the same racks as the application. It is safest to remove that comment: was: The comment text which came in with SPARK-1612 says {code} - Although S3/S3A/S3N file system can be quite slow for remote file metadata // operations, calling `getFileBlockLocations` does no harm here since these file system // implementations don't actually issue RPC for this method. {code} That doesn't hold for openstack swift , which can exhibit rack locality if the swift nodes are spread across the same racks as the application. It is safest to remove that comment: > ListingFileCatalog comments about RPC calls in object store isn't correct > - > > Key: SPARK-16737 > URL: https://issues.apache.org/jira/browse/SPARK-16737 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Steve Loughran >Priority: Trivial > > The comment text which came in with SPARK-16121 says > {code} > - Although S3/S3A/S3N file system can be quite slow for remote file metadata > // operations, calling `getFileBlockLocations` does no harm here > since these file system > // implementations don't actually issue RPC for this method. > {code} > That doesn't hold for openstack swift , which can exhibit rack locality if > the swift nodes are spread across the same racks as the application. It is > safest to remove that comment: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection
[ https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Colombo updated SPARK-16717: -- Description: In JdbcRRD it was possible to use a function to get a JDBC connection. This allow an external handling of the connections while now this is no more possible with dataframes. Please consider an addition to Dataframes for using an externally provided connectionFactory (such as a connection pool) in order to make data loading more efficient, avoiding connection close/recreation. Connections should be taken from provided function and returned to a second function whenever no more used by the RRD. So this will make jdbc handling more efficient. I.e. extending DataFrame class with a method like jdbc(Function0 getConnection, Function0 releaseConnection(java.sql.Connection)) was: In JdbcRRD it was possible to use a function to get a JDBC connection. This allow an external handling of the connections while now this is no more possible with dataframes. Please consider an addition to Dataframes for using an externally provided connectionFactory (such as a connection pool) in order to make data loading more efficient, avoiding connection close/recreation. > Dataframe (jdbc) is missing a way to link and external function to get a > connection > --- > > Key: SPARK-16717 > URL: https://issues.apache.org/jira/browse/SPARK-16717 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.2 >Reporter: Marco Colombo > > In JdbcRRD it was possible to use a function to get a JDBC connection. This > allow an external handling of the connections while now this is no more > possible with dataframes. > Please consider an addition to Dataframes for using an externally provided > connectionFactory (such as a connection pool) in order to make data loading > more efficient, avoiding connection close/recreation. Connections should be > taken from provided function and returned to a second function whenever no > more used by the RRD. So this will make jdbc handling more efficient. > I.e. extending DataFrame class with a method like > jdbc(Function0 getConnection, Function0 > releaseConnection(java.sql.Connection)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection
[ https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393696#comment-15393696 ] Marco Colombo commented on SPARK-16717: --- Typo, I mean 1.6.2 > Dataframe (jdbc) is missing a way to link and external function to get a > connection > --- > > Key: SPARK-16717 > URL: https://issues.apache.org/jira/browse/SPARK-16717 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.2 >Reporter: Marco Colombo > > In JdbcRRD it was possible to use a function to get a JDBC connection. This > allow an external handling of the connections while now this is no more > possible with dataframes. > Please consider an addition to Dataframes for using an externally provided > connectionFactory (such as a connection pool) in order to make data loading > more efficient, avoiding connection close/recreation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection
[ https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Colombo updated SPARK-16717: -- Affects Version/s: (was: 1.3.0) 1.6.2 > Dataframe (jdbc) is missing a way to link and external function to get a > connection > --- > > Key: SPARK-16717 > URL: https://issues.apache.org/jira/browse/SPARK-16717 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.2 >Reporter: Marco Colombo > > In JdbcRRD it was possible to use a function to get a JDBC connection. This > allow an external handling of the connections while now this is no more > possible with dataframes. > Please consider an addition to Dataframes for using an externally provided > connectionFactory (such as a connection pool) in order to make data loading > more efficient, avoiding connection close/recreation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16735) Fail to create a map contains decimal type with literals having different inferred precessions and scales
[ https://issues.apache.org/jira/browse/SPARK-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393639#comment-15393639 ] Liang Ke edited comment on SPARK-16735 at 7/26/16 12:00 PM: hi, if some one can help me to push my patch to github ? thx a lot : ) was (Author: biglobster): hi, if some one can help me to push my patch to github? thx alot:) > Fail to create a map contains decimal type with literals having different > inferred precessions and scales > - > > Key: SPARK-16735 > URL: https://issues.apache.org/jira/browse/SPARK-16735 > Project: Spark > Issue Type: Sub-task >Affects Versions: 2.0.0, 2.0.1 >Reporter: Liang Ke > Attachments: SPARK-16735.patch > > > In Spark 2.0, we will parse float literals as decimals. However, it > introduces a side-effect, which is described below. > spark-sql> select map(0.1,0.01, 0.2,0.033); > Error in query: cannot resolve 'map(CAST(0.1 AS DECIMAL(1,1)), CAST(0.01 AS > DECIMAL(2,2)), CAST(0.2 AS DECIMAL(1,1)), CAST(0.033 AS DECIMAL(3,3)))' due > to data type mismatch: The given values of function map should all be the > same type, but they are [decimal(2,2), decimal(3,3)]; line 1 pos 7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org