[jira] [Resolved] (SPARK-15313) EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.
[ https://issues.apache.org/jira/browse/SPARK-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15313. - Resolution: Fixed Assignee: Takuya Ueshin Fix Version/s: 2.0.0 > EmbedSerializerInFilter rule should keep exprIds of output of surrounded > SerializeFromObject. > - > > Key: SPARK-15313 > URL: https://issues.apache.org/jira/browse/SPARK-15313 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.0.0 > > > The following code: > {code} > val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() > ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_)) > {code} > throws an Exception: > {noformat} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _1#420 > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:254) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79) > at > org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:79) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at >
[jira] [Comment Edited] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files
[ https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292750#comment-15292750 ] Hyukjin Kwon edited comment on SPARK-15393 at 5/20/16 5:51 AM: --- [~jurriaanpruis] Hm.. I am trying to reproduce this exceptions. I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this **after/before my PR**: {code} test("SPARK-15393: create empty file") { withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") { withTempPath { path => val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) emptyDf.write .format("parquet") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("parquet") .load(path.getCanonicalPath) copyEmptyDf.show() } } } {code} I could reproduce the exceptions when reading but could not reproduce the exceptions when it writes yet on both ones after/before the PR. (I run more than 10 times on ones after/before my PR) It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be the reason for this exceptions. Do you mind if I ask the codes you run? was (Author: hyukjin.kwon): [~jurriaanpruis] Hm.. I am trying to reproduce this exceptions. I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this after/before my PR: {code} test("SPARK-15393: create empty file") { withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") { withTempPath { path => val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) emptyDf.write .format("parquet") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("parquet") .load(path.getCanonicalPath) copyEmptyDf.show() } } } {code} I could reproduce the exceptions when reading but could not reproduce the exceptions when it writes yet on both ones after/before the PR. (I run more than 10 times on ones after/before my PR) It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be the reason for this exceptions. Do you mind if I ask the codes you run? > Writing empty Dataframes doesn't save any _metadata files > - > > Key: SPARK-15393 > URL: https://issues.apache.org/jira/browse/SPARK-15393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jurriaan Pruis >Priority: Critical > > Writing empty dataframes is broken on latest master. > It omits the metadata and sometimes throws the following exception (when > saving as parquet): > {code} > 8-May-2016 22:37:14 WARNING: > org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary > file for file:/some/test/file > java.lang.NullPointerException > at > org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at >
[jira] [Created] (SPARK-15434) improve EmbedSerializerInFilter rule
Wenchen Fan created SPARK-15434: --- Summary: improve EmbedSerializerInFilter rule Key: SPARK-15434 URL: https://issues.apache.org/jira/browse/SPARK-15434 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files
[ https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292750#comment-15292750 ] Hyukjin Kwon commented on SPARK-15393: -- Hm.. I am trying to reproduce this exceptions. I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this after/before my PR: {code} test("SPARK-15393: create empty file") { withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") { withTempPath { path => val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) emptyDf.write .format("parquet") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("parquet") .load(path.getCanonicalPath) copyEmptyDf.show() } } } {code} I could reproduce the exceptions when reading but could not reproduce the exceptions when it writes yet on both ones after/before the PR. (I run more than 10 times on ones after/before my PR) It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be the reason for this exceptions. Do you mind if I ask the codes you run? > Writing empty Dataframes doesn't save any _metadata files > - > > Key: SPARK-15393 > URL: https://issues.apache.org/jira/browse/SPARK-15393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jurriaan Pruis >Priority: Critical > > Writing empty dataframes is broken on latest master. > It omits the metadata and sometimes throws the following exception (when > saving as parquet): > {code} > 8-May-2016 22:37:14 WARNING: > org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary > file for file:/some/test/file > java.lang.NullPointerException > at > org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234) > at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at
[jira] [Comment Edited] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files
[ https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292750#comment-15292750 ] Hyukjin Kwon edited comment on SPARK-15393 at 5/20/16 5:36 AM: --- [~jurriaanpruis] Hm.. I am trying to reproduce this exceptions. I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this after/before my PR: {code} test("SPARK-15393: create empty file") { withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") { withTempPath { path => val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) emptyDf.write .format("parquet") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("parquet") .load(path.getCanonicalPath) copyEmptyDf.show() } } } {code} I could reproduce the exceptions when reading but could not reproduce the exceptions when it writes yet on both ones after/before the PR. (I run more than 10 times on ones after/before my PR) It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be the reason for this exceptions. Do you mind if I ask the codes you run? was (Author: hyukjin.kwon): Hm.. I am trying to reproduce this exceptions. I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this after/before my PR: {code} test("SPARK-15393: create empty file") { withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") { withTempPath { path => val schema = StructType( StructField("k", StringType, true) :: StructField("v", IntegerType, false) :: Nil) val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) emptyDf.write .format("parquet") .save(path.getCanonicalPath) val copyEmptyDf = spark.read .format("parquet") .load(path.getCanonicalPath) copyEmptyDf.show() } } } {code} I could reproduce the exceptions when reading but could not reproduce the exceptions when it writes yet on both ones after/before the PR. (I run more than 10 times on ones after/before my PR) It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be the reason for this exceptions. Do you mind if I ask the codes you run? > Writing empty Dataframes doesn't save any _metadata files > - > > Key: SPARK-15393 > URL: https://issues.apache.org/jira/browse/SPARK-15393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jurriaan Pruis >Priority: Critical > > Writing empty dataframes is broken on latest master. > It omits the metadata and sometimes throws the following exception (when > saving as parquet): > {code} > 8-May-2016 22:37:14 WARNING: > org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary > file for file:/some/test/file > java.lang.NullPointerException > at > org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) >
[jira] [Updated] (SPARK-15057) Remove stale TODO comment for making `enum` in GraphGenerators
[ https://issues.apache.org/jira/browse/SPARK-15057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15057: Fix Version/s: 2.0.0 > Remove stale TODO comment for making `enum` in GraphGenerators > -- > > Key: SPARK-15057 > URL: https://issues.apache.org/jira/browse/SPARK-15057 > Project: Spark > Issue Type: Task > Components: GraphX >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Trivial > Fix For: 2.0.0 > > > This PR removes a stale TODO comment in GraphGenerators.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15057) Remove stale TODO comment for making `enum` in GraphGenerators
[ https://issues.apache.org/jira/browse/SPARK-15057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15057: Fix Version/s: (was: 2.1.0) > Remove stale TODO comment for making `enum` in GraphGenerators > -- > > Key: SPARK-15057 > URL: https://issues.apache.org/jira/browse/SPARK-15057 > Project: Spark > Issue Type: Task > Components: GraphX >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Trivial > Fix For: 2.0.0 > > > This PR removes a stale TODO comment in GraphGenerators.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14261: Assignee: Oleg Danilov > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang >Assignee: Oleg Danilov > Fix For: 1.6.2, 2.0.0 > > Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, > 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, > 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, > MemorySnapshot.PNG > > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14261. - Resolution: Fixed Fix Version/s: 2.0.0 1.6.2 > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang > Fix For: 1.6.2, 2.0.0 > > Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, > 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, > 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, > MemorySnapshot.PNG > > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15433: Assignee: Apache Spark > PySpark core test should not use SerDe from PythonMLLibAPI > -- > > Key: SPARK-15433 > URL: https://issues.apache.org/jira/browse/SPARK-15433 > Project: Spark > Issue Type: Test > Components: PySpark >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > > Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes > many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292739#comment-15292739 ] Apache Spark commented on SPARK-15433: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/13214 > PySpark core test should not use SerDe from PythonMLLibAPI > -- > > Key: SPARK-15433 > URL: https://issues.apache.org/jira/browse/SPARK-15433 > Project: Spark > Issue Type: Test > Components: PySpark >Reporter: Liang-Chi Hsieh >Priority: Minor > > Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes > many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15433: Assignee: (was: Apache Spark) > PySpark core test should not use SerDe from PythonMLLibAPI > -- > > Key: SPARK-15433 > URL: https://issues.apache.org/jira/browse/SPARK-15433 > Project: Spark > Issue Type: Test > Components: PySpark >Reporter: Liang-Chi Hsieh >Priority: Minor > > Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes > many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miao Wang updated SPARK-15433: -- Comment: was deleted (was: [~viirya] If you are not working on this one, I would like to take it. Thanks!) > PySpark core test should not use SerDe from PythonMLLibAPI > -- > > Key: SPARK-15433 > URL: https://issues.apache.org/jira/browse/SPARK-15433 > Project: Spark > Issue Type: Test > Components: PySpark >Reporter: Liang-Chi Hsieh >Priority: Minor > > Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes > many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
[ https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292738#comment-15292738 ] Miao Wang commented on SPARK-15433: --- [~viirya] If you are not working on this one, I would like to take it. Thanks! > PySpark core test should not use SerDe from PythonMLLibAPI > -- > > Key: SPARK-15433 > URL: https://issues.apache.org/jira/browse/SPARK-15433 > Project: Spark > Issue Type: Test > Components: PySpark >Reporter: Liang-Chi Hsieh >Priority: Minor > > Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes > many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI
Liang-Chi Hsieh created SPARK-15433: --- Summary: PySpark core test should not use SerDe from PythonMLLibAPI Key: SPARK-15433 URL: https://issues.apache.org/jira/browse/SPARK-15433 Project: Spark Issue Type: Test Components: PySpark Reporter: Liang-Chi Hsieh Priority: Minor Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes many MLlib things. It should use SerDeUtil instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14990) nvl, coalesce, array functions with parameter of type "array"
[ https://issues.apache.org/jira/browse/SPARK-14990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14990. - Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 2.0.0 > nvl, coalesce, array functions with parameter of type "array" > - > > Key: SPARK-14990 > URL: https://issues.apache.org/jira/browse/SPARK-14990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Oleg Danilov >Assignee: Reynold Xin >Priority: Minor > Fix For: 2.0.0 > > > Steps to reproduce: > 1. create table tmp(col1 int, col2 array) > 2. insert values: > {code} > 1, [1] > 2, [2] > 3, NULL > {code} > 3. run query > select col1, coalesce(col2, array(1,2,3)) from tmp; > Expected result: > {code} > 1, [1] > 2, [2] > 3, [1,2,3] > {code} > Current result: > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'coalesce(col2,array(1,2,3))' due to data type mismatch: input to function > coalesce should all be the same type, but it's [array, array]; line > 1 pos 38 (state=,code=0) > {code} > The fix seems to be pretty easy, will create a PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15363: Assignee: (was: Apache Spark) > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292723#comment-15292723 ] Apache Spark commented on SPARK-15363: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/13213 > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15363: Assignee: Apache Spark > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Apache Spark > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292719#comment-15292719 ] Maciej Bryński commented on SPARK-15345: Will try to test it with [~m1lan] today. > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure >
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292717#comment-15292717 ] Reynold Xin commented on SPARK-15345: - Haven't tested it :) I am a little bit busy later today. Do you want to take a look at the Python part? Also cc [~andrewor14] who might be able to work on the Python part. > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292715#comment-15292715 ] Maciej Bryński commented on SPARK-15345: [~rxin] Are you planning another PR for Python as in comment ? "I updated Python docs. The Python change seems slightly larger and since it is not user facing, I'm going to defer it to another pr." Or should I assume that Python part is working ? > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by
[jira] [Resolved] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15345. - Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 2.0.0 I think I have fixed it in https://github.com/apache/spark/pull/13200. If there is still a problem, please reopen. Thanks. > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Piotr Milanowski >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing
[jira] [Resolved] (SPARK-15075) Cleanup dependencies between SQLContext and SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15075. - Resolution: Fixed Assignee: Reynold Xin > Cleanup dependencies between SQLContext and SparkSession > > > Key: SPARK-15075 > URL: https://issues.apache.org/jira/browse/SPARK-15075 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > We currently in SparkSession.Builder use SQLContext.getOrCreate. It should > probably the the other way around, i.e. all the core logic goes in > SparkSession, and SQLContext just calls that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database
[ https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292706#comment-15292706 ] Xiao Li commented on SPARK-15396: - Based on your description, it sounds like that is caused by the parameter you used. If you can fix the issue by setting `spark.sql.warehouse.dir`, I think that is a document issue. I will submit a PR soon. Thanks! > [Spark] [SQL] [DOC] It can't connect hive metastore database > > > Key: SPARK-15396 > URL: https://issues.apache.org/jira/browse/SPARK-15396 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master > code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue > that it always connect local derby database and can't connect my existing > hive metastore database. Could you help me to check what's the root cause ? > What's specific configuration for integration with hive metastore in Spark > 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance ! > Build package command: > ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests > Key configurations in spark-defaults.conf: > {code} > spark.sql.hive.metastore.version=1.1.0 > spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* > spark.executor.extraClassPath=/etc/hive/conf > spark.driver.extraClassPath=/etc/hive/conf > spark.yarn.jars=local:/usr/lib/spark/jars/* > {code} > There is existing hive metastore database named by "test_sparksql". I always > got error "metastore.ObjectStore: Failed to get database test_sparksql, > returning NoSuchObjectException" after issuing 'use test_sparksql'. Please > see below steps for details. > > $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name > hive.enable.spark.execution.engine does not exist > 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) > "org.datanucleus.store.rdbms" is already registered. Ensure you dont have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, > and you are trying to register an identical plugin located at URL > "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar." > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" > is already registered. Ensure you dont have multiple JAR versions of the same > plugin in the classpath. The URL > "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, > and you are trying to register an identical plugin located at URL > "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar." > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) > "org.datanucleus.api.jdo" is already registered. Ensure you dont have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already > registered, and you are trying to register an identical plugin located at URL > "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar." > 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property > datanucleus.cache.level2 unknown - will be ignored > 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name > hive.enable.spark.execution.engine does not exist > 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin > classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does
[jira] [Updated] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs
[ https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-11827: Assignee: kevin yu > Support java.math.BigInteger in Type-Inference utilities for POJOs > -- > > Key: SPARK-11827 > URL: https://issues.apache.org/jira/browse/SPARK-11827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Abhilash Srimat Tirumala Pallerlamudi >Assignee: kevin yu >Priority: Minor > Fix For: 2.0.0 > > > I get the below exception when creating DataFrame using RDD of JavaBean > having a property of type java.math.BigInteger > scala.MatchError: class java.math.BigInteger (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447) > I don't see the support for java.math.BigInteger in > org.apache.spark.sql.catalyst.JavaTypeInference.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs
[ https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-11827. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10125 [https://github.com/apache/spark/pull/10125] > Support java.math.BigInteger in Type-Inference utilities for POJOs > -- > > Key: SPARK-11827 > URL: https://issues.apache.org/jira/browse/SPARK-11827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Abhilash Srimat Tirumala Pallerlamudi >Priority: Minor > Fix For: 2.0.0 > > > I get the below exception when creating DataFrame using RDD of JavaBean > having a property of type java.math.BigInteger > scala.MatchError: class java.math.BigInteger (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447) > I don't see the support for java.math.BigInteger in > org.apache.spark.sql.catalyst.JavaTypeInference.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively
[ https://issues.apache.org/jira/browse/SPARK-15431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292701#comment-15292701 ] Apache Spark commented on SPARK-15431: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/13212 > Support LIST FILE(s)|JAR(s) command natively > > > Key: SPARK-15431 > URL: https://issues.apache.org/jira/browse/SPARK-15431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Currently command "ADD FILE|JAR" is supported natively in > SparkSQL. However, when this command is run, the file/jar is added to the > resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because > the LIST command is passed to Hive command processor in Spark-SQL or simply > not supported in Spark-shell. There is no way users can find out what > files/jars are added to the spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively
[ https://issues.apache.org/jira/browse/SPARK-15431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15431: Assignee: Apache Spark > Support LIST FILE(s)|JAR(s) command natively > > > Key: SPARK-15431 > URL: https://issues.apache.org/jira/browse/SPARK-15431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu >Assignee: Apache Spark > > Currently command "ADD FILE|JAR" is supported natively in > SparkSQL. However, when this command is run, the file/jar is added to the > resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because > the LIST command is passed to Hive command processor in Spark-SQL or simply > not supported in Spark-shell. There is no way users can find out what > files/jars are added to the spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively
[ https://issues.apache.org/jira/browse/SPARK-15431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15431: Assignee: (was: Apache Spark) > Support LIST FILE(s)|JAR(s) command natively > > > Key: SPARK-15431 > URL: https://issues.apache.org/jira/browse/SPARK-15431 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xin Wu > > Currently command "ADD FILE|JAR" is supported natively in > SparkSQL. However, when this command is run, the file/jar is added to the > resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because > the LIST command is passed to Hive command processor in Spark-SQL or simply > not supported in Spark-shell. There is no way users can find out what > files/jars are added to the spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292689#comment-15292689 ] Miao Wang commented on SPARK-15363: --- [~mengxr] Yanbo pointed me https://github.com/apache/spark/pull/13202. Now I understand using public API in this PR in the example code. Thanks! > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15432) Two executors with same id in Spark UI
Davies Liu created SPARK-15432: -- Summary: Two executors with same id in Spark UI Key: SPARK-15432 URL: https://issues.apache.org/jira/browse/SPARK-15432 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Davies Liu Both of them are dead. {code} 56 10.0.245.96:50929 Dead0 0.0 B / 15.3 GB 0.0 B 4 0 0 0 0 0 ms (0 ms) 0.0 B 0.0 B 0.0 B stdout stderr 56 10.0.245.96:50929 Dead0 0.0 B / 15.3 GB 0.0 B 4 0 0 0 0 0 ms (0 ms) 0.0 B 0.0 B 0.0 B stdout stderr {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15321) Encoding/decoding of Array[Timestamp] fails
[ https://issues.apache.org/jira/browse/SPARK-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15321. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13108 [https://github.com/apache/spark/pull/13108] > Encoding/decoding of Array[Timestamp] fails > --- > > Key: SPARK-15321 > URL: https://issues.apache.org/jira/browse/SPARK-15321 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sumedh Mungee >Assignee: Sumedh Mungee > Fix For: 2.0.0 > > > In {{ExpressionEncoderSuite}}, if you add the following test case: > {code} > encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of > timestamp") > {code} > ... you will see that (without this fix) it fails with the following output: > {code} > - encode/decode for array of timestamp: [Ljava.sql.Timestamp;@fd9ebde *** > FAILED *** > Exception thrown while decoding > Converted: [0,100010,80001,52a7ccdc36800] > Schema: value#61615 > root > -- value: array (nullable = true) > |-- element: timestamp (containsNull = true) > Encoder: > class[value[0]: array] (ExpressionEncoderSuite.scala:312) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively
Xin Wu created SPARK-15431: -- Summary: Support LIST FILE(s)|JAR(s) command natively Key: SPARK-15431 URL: https://issues.apache.org/jira/browse/SPARK-15431 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xin Wu Currently command "ADD FILE|JAR" is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because the LIST command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15430: Assignee: Apache Spark > Access ListAccumulator's value could possibly cause > java.util.ConcurrentModificationException > - > > Key: SPARK-15430 > URL: https://issues.apache.org/jira/browse/SPARK-15430 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > In ListAccumulator we create an unmodifiable view for underlying list. > However, it doesn't prevent the underlying to be modified further. So as we > access the unmodifiable list, the underlying list can be modified in the same > time. It could cause java.util.ConcurrentModificationException. We should fix > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292682#comment-15292682 ] Apache Spark commented on SPARK-15430: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/13211 > Access ListAccumulator's value could possibly cause > java.util.ConcurrentModificationException > - > > Key: SPARK-15430 > URL: https://issues.apache.org/jira/browse/SPARK-15430 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > In ListAccumulator we create an unmodifiable view for underlying list. > However, it doesn't prevent the underlying to be modified further. So as we > access the unmodifiable list, the underlying list can be modified in the same > time. It could cause java.util.ConcurrentModificationException. We should fix > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException
[ https://issues.apache.org/jira/browse/SPARK-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15430: Assignee: (was: Apache Spark) > Access ListAccumulator's value could possibly cause > java.util.ConcurrentModificationException > - > > Key: SPARK-15430 > URL: https://issues.apache.org/jira/browse/SPARK-15430 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > In ListAccumulator we create an unmodifiable view for underlying list. > However, it doesn't prevent the underlying to be modified further. So as we > access the unmodifiable list, the underlying list can be modified in the same > time. It could cause java.util.ConcurrentModificationException. We should fix > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException
Liang-Chi Hsieh created SPARK-15430: --- Summary: Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException Key: SPARK-15430 URL: https://issues.apache.org/jira/browse/SPARK-15430 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh In ListAccumulator we create an unmodifiable view for underlying list. However, it doesn't prevent the underlying to be modified further. So as we access the unmodifiable list, the underlying list can be modified in the same time. It could cause java.util.ConcurrentModificationException. We should fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15429) When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the receiving rate accurately.
Albert Cheng created SPARK-15429: Summary: When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the receiving rate accurately. Key: SPARK-15429 URL: https://issues.apache.org/jira/browse/SPARK-15429 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.6.1 Reporter: Albert Cheng When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the receiving rate accurately. For example, if the batch duration is set to 10 seconds, each rdd in the dstream will take 20s to compute. By changing `spark.streaming.concurrentJobs=2`, each rdd in the dstream still takes 20s to consume the data, which leads to poor estimation of backpressure by PIDRateEstimator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15308) RowEncoder should preserve nested column name.
[ https://issues.apache.org/jira/browse/SPARK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15308: - Target Version/s: 2.0.0 > RowEncoder should preserve nested column name. > -- > > Key: SPARK-15308 > URL: https://issues.apache.org/jira/browse/SPARK-15308 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > The following code generates wrong schema: > {code} > val schema = new StructType().add( > "struct", > new StructType() > .add("i", IntegerType, nullable = false) > .add( > "s", > new StructType().add("int", IntegerType, nullable = false), > nullable = false), > nullable = false) > val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema)) > ds.printSchema() > {code} > This should print as follows: > {code} > root > |-- struct: struct (nullable = false) > ||-- i: integer (nullable = false) > ||-- s: struct (nullable = false) > |||-- int: integer (nullable = false) > {code} > but the result is: > {code} > root > |-- struct: struct (nullable = false) > ||-- col1: integer (nullable = false) > ||-- col2: struct (nullable = false) > |||-- col1: integer (nullable = false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15313) EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.
[ https://issues.apache.org/jira/browse/SPARK-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15313: - Target Version/s: 2.0.0 > EmbedSerializerInFilter rule should keep exprIds of output of surrounded > SerializeFromObject. > - > > Key: SPARK-15313 > URL: https://issues.apache.org/jira/browse/SPARK-15313 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > The following code: > {code} > val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS() > ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_)) > {code} > throws an Exception: > {noformat} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _1#420 > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:254) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79) > at > org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:79) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) >
[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database
[ https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292672#comment-15292672 ] Yi Zhou commented on SPARK-15396: - It seem it is not only a doc issue , it may be functional issue. > [Spark] [SQL] [DOC] It can't connect hive metastore database > > > Key: SPARK-15396 > URL: https://issues.apache.org/jira/browse/SPARK-15396 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master > code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue > that it always connect local derby database and can't connect my existing > hive metastore database. Could you help me to check what's the root cause ? > What's specific configuration for integration with hive metastore in Spark > 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance ! > Build package command: > ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests > Key configurations in spark-defaults.conf: > {code} > spark.sql.hive.metastore.version=1.1.0 > spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* > spark.executor.extraClassPath=/etc/hive/conf > spark.driver.extraClassPath=/etc/hive/conf > spark.yarn.jars=local:/usr/lib/spark/jars/* > {code} > There is existing hive metastore database named by "test_sparksql". I always > got error "metastore.ObjectStore: Failed to get database test_sparksql, > returning NoSuchObjectException" after issuing 'use test_sparksql'. Please > see below steps for details. > > $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name > hive.enable.spark.execution.engine does not exist > 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) > "org.datanucleus.store.rdbms" is already registered. Ensure you dont have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, > and you are trying to register an identical plugin located at URL > "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar." > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" > is already registered. Ensure you dont have multiple JAR versions of the same > plugin in the classpath. The URL > "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, > and you are trying to register an identical plugin located at URL > "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar." > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) > "org.datanucleus.api.jdo" is already registered. Ensure you dont have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already > registered, and you are trying to register an identical plugin located at URL > "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar." > 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property > datanucleus.cache.level2 unknown - will be ignored > 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name > hive.enable.spark.execution.engine does not exist > 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin > classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged
[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database
[ https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292655#comment-15292655 ] Yi Zhou commented on SPARK-15396: - Hi [~rxin] I saw a bug fix https://issues.apache.org/jira/browse/SPARK-15345. Is it also fixed this issue in this jira ? > [Spark] [SQL] [DOC] It can't connect hive metastore database > > > Key: SPARK-15396 > URL: https://issues.apache.org/jira/browse/SPARK-15396 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master > code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue > that it always connect local derby database and can't connect my existing > hive metastore database. Could you help me to check what's the root cause ? > What's specific configuration for integration with hive metastore in Spark > 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance ! > Build package command: > ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 > -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests > Key configurations in spark-defaults.conf: > {code} > spark.sql.hive.metastore.version=1.1.0 > spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/* > spark.executor.extraClassPath=/etc/hive/conf > spark.driver.extraClassPath=/etc/hive/conf > spark.yarn.jars=local:/usr/lib/spark/jars/* > {code} > There is existing hive metastore database named by "test_sparksql". I always > got error "metastore.ObjectStore: Failed to get database test_sparksql, > returning NoSuchObjectException" after issuing 'use test_sparksql'. Please > see below steps for details. > > $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name > hive.enable.spark.execution.engine does not exist > 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) > "org.datanucleus.store.rdbms" is already registered. Ensure you dont have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, > and you are trying to register an identical plugin located at URL > "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar." > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" > is already registered. Ensure you dont have multiple JAR versions of the same > plugin in the classpath. The URL > "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, > and you are trying to register an identical plugin located at URL > "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar." > 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) > "org.datanucleus.api.jdo" is already registered. Ensure you dont have > multiple JAR versions of the same plugin in the classpath. The URL > "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already > registered, and you are trying to register an identical plugin located at URL > "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar." > 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property > datanucleus.cache.level2 unknown - will be ignored > 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name > hive.enable.spark.execution.engine does not exist > 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin > classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class >
[jira] [Resolved] (SPARK-15296) Refactor All Java Tests that use SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15296. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13101 [https://github.com/apache/spark/pull/13101] > Refactor All Java Tests that use SparkSession > - > > Key: SPARK-15296 > URL: https://issues.apache.org/jira/browse/SPARK-15296 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Minor > Fix For: 2.0.0 > > > There's a lot of Duplicate code in Java tests. {{setUp()}} and {{tearDown()}} > of most java test classes in ML,MLLib. > So will create a {{SharedSparkSession}} class that has common code for > {{setUp}} and {{tearDown}} and other classes just extend that class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15428) Disable support for multiple streaming aggregations
[ https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-15428: -- Target Version/s: 2.0.0 > Disable support for multiple streaming aggregations > --- > > Key: SPARK-15428 > URL: https://issues.apache.org/jira/browse/SPARK-15428 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Incrementalizing plans of with multiple streaming aggregation is tricky and > we dont have the necessary support for "delta" to implement correctly. So > disabling the support for multiple streaming aggregations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15428) Disable support for multiple streaming aggregations
[ https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15428: Assignee: Apache Spark (was: Tathagata Das) > Disable support for multiple streaming aggregations > --- > > Key: SPARK-15428 > URL: https://issues.apache.org/jira/browse/SPARK-15428 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > Incrementalizing plans of with multiple streaming aggregation is tricky and > we dont have the necessary support for "delta" to implement correctly. So > disabling the support for multiple streaming aggregations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15428) Disable support for multiple streaming aggregations
[ https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292629#comment-15292629 ] Apache Spark commented on SPARK-15428: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/13210 > Disable support for multiple streaming aggregations > --- > > Key: SPARK-15428 > URL: https://issues.apache.org/jira/browse/SPARK-15428 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Incrementalizing plans of with multiple streaming aggregation is tricky and > we dont have the necessary support for "delta" to implement correctly. So > disabling the support for multiple streaming aggregations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15428) Disable support for multiple streaming aggregations
[ https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15428: Assignee: Tathagata Das (was: Apache Spark) > Disable support for multiple streaming aggregations > --- > > Key: SPARK-15428 > URL: https://issues.apache.org/jira/browse/SPARK-15428 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Incrementalizing plans of with multiple streaming aggregation is tricky and > we dont have the necessary support for "delta" to implement correctly. So > disabling the support for multiple streaming aggregations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15428) Disable support for multiple streaming aggregations
Tathagata Das created SPARK-15428: - Summary: Disable support for multiple streaming aggregations Key: SPARK-15428 URL: https://issues.apache.org/jira/browse/SPARK-15428 Project: Spark Issue Type: Sub-task Reporter: Tathagata Das Assignee: Tathagata Das Incrementalizing plans of with multiple streaming aggregation is tricky and we dont have the necessary support for "delta" to implement correctly. So disabling the support for multiple streaming aggregations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15427) Spark SQL doesn't support field case sensitive when load data use Phoenix
[ https://issues.apache.org/jira/browse/SPARK-15427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deng updated SPARK-15427: - Description: I use sparkSql load data from Apache Phoenix. SQLContext sqlContext = new SQLContext(sc); Mapoptions = new HashMap(); options.put("driver", driver); options.put("url", PhoenixUtil.p.getProperty("phoenixURL")); options.put("dbtable", "(select "value","name" from "user")"); DataFrame jdbcDF = sqlContext.load("jdbc", options); It always throws exception, like "can't find field VALUE". I tracked the code and found spark will use: val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0").executeQuery() to get the field.But the field already be uppercased, like "value" to VALUE. So it will always throws "can't find field VALUE"; It didn't think of the the case when data loaded from source in which filed is case sensitive. was: i am use sparkSql load data from Apache Phoenix. SQLContext sqlContext = new SQLContext(sc); Map options = new HashMap(); options.put("driver", driver); options.put("url", PhoenixUtil.p.getProperty("phoenixURL")); options.put("dbtable", "(select "value","name" from "user")"); DataFrame jdbcDF = sqlContext.load("jdbc", options); It will always throws exception, like "can't find field VALUE". I track the code and find spark will use: val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0").executeQuery() to get the field.But the field already be uppercase like "value" to VALUE. So it will always throws "can't find field VALUE"; It didn't think of the the case when data loaded from source in which filed is case sensitive. > Spark SQL doesn't support field case sensitive when load data use Phoenix > - > > Key: SPARK-15427 > URL: https://issues.apache.org/jira/browse/SPARK-15427 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: deng > Labels: easyfix, features, newbie > > I use sparkSql load data from Apache Phoenix. > SQLContext sqlContext = new SQLContext(sc); > Map options = new HashMap(); > options.put("driver", driver); > options.put("url", PhoenixUtil.p.getProperty("phoenixURL")); > options.put("dbtable", "(select "value","name" from "user")"); > DataFrame jdbcDF = sqlContext.load("jdbc", options); > It always throws exception, like "can't find field VALUE". > I tracked the code and found spark will use: > val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE > 1=0").executeQuery() > to get the field.But the field already be uppercased, like "value" to VALUE. > So it will always throws "can't find field VALUE"; > It didn't think of the the case when data loaded from source in which filed > is case sensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15427) Spark SQL doesn't support field case sensitive when load data use Phoenix
deng created SPARK-15427: Summary: Spark SQL doesn't support field case sensitive when load data use Phoenix Key: SPARK-15427 URL: https://issues.apache.org/jira/browse/SPARK-15427 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: deng i am use sparkSql load data from Apache Phoenix. SQLContext sqlContext = new SQLContext(sc); Mapoptions = new HashMap(); options.put("driver", driver); options.put("url", PhoenixUtil.p.getProperty("phoenixURL")); options.put("dbtable", "(select "value","name" from "user")"); DataFrame jdbcDF = sqlContext.load("jdbc", options); It will always throws exception, like "can't find field VALUE". I track the code and find spark will use: val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0").executeQuery() to get the field.But the field already be uppercase like "value" to VALUE. So it will always throws "can't find field VALUE"; It didn't think of the the case when data loaded from source in which filed is case sensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292599#comment-15292599 ] Nicholas Chammas commented on SPARK-3821: - You can deploy Spark today on Docker just fine. It's just that Spark itself does not maintain any official Dockerfiles and likely never will since the project is actually trying to push deployment stuff outside the main project (hence why spark-ec2 was moved out; you will not see spark-ec2 in the official docs once Spark 2.0 comes out). You may be more interested in the Apache Big Top project, which focuses on big data system deployment (including Spark) and may have stuff for Docker specifically. Mesos is a separate matter, because it's a resource manager (analogous to YARN) that integrates with Spark at a low level. If you still think Spark should host and maintain an official Dockerfile and Docker images that are suitable for production use, please open a separate issue. I think the maintainers will reject it on the grounds that I have explained here, though. (Can't say for sure; after all I'm just a random contributor.) > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15425: Assignee: Apache Spark > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. > Note that the error message should be friendly and say "Set > spark.sql.join.enableCartesian to true to turn on cartesian joins." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292558#comment-15292558 ] Apache Spark commented on SPARK-15425: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/13209 > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. > Note that the error message should be friendly and say "Set > spark.sql.join.enableCartesian to true to turn on cartesian joins." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15425: Assignee: (was: Apache Spark) > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. > Note that the error message should be friendly and say "Set > spark.sql.join.enableCartesian to true to turn on cartesian joins." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15420: Target Version/s: 2.1.0 > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14543: Target Version/s: 2.0.0 > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue > > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15296) Refactor All Java Tests that use SparkSession
[ https://issues.apache.org/jira/browse/SPARK-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15296: Assignee: Sandeep Singh Target Version/s: 2.0.0 > Refactor All Java Tests that use SparkSession > - > > Key: SPARK-15296 > URL: https://issues.apache.org/jira/browse/SPARK-15296 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, Tests >Reporter: Sandeep Singh >Assignee: Sandeep Singh >Priority: Minor > > There's a lot of Duplicate code in Java tests. {{setUp()}} and {{tearDown()}} > of most java test classes in ML,MLLib. > So will create a {{SharedSparkSession}} class that has common code for > {{setUp}} and {{tearDown}} and other classes just extend that class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14261: Target Version/s: 2.0.0 > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang > Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, > 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, > 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, > MemorySnapshot.PNG > > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14261: Target Version/s: 1.6.2, 2.0.0 (was: 2.0.0) > Memory leak in Spark Thrift Server > -- > > Key: SPARK-14261 > URL: https://issues.apache.org/jira/browse/SPARK-14261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xiaochun Liang > Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, > 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, > 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, > MemorySnapshot.PNG > > > I am running Spark Thrift server on Windows Server 2012. The Spark Thrift > server is launched as Yarn client mode. Its memory usage is increased > gradually with the queries in. I am wondering there is memory leak in Spark > Thrift server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15321) Encoding/decoding of Array[Timestamp] fails
[ https://issues.apache.org/jira/browse/SPARK-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15321: Assignee: Sumedh Mungee Affects Version/s: (was: 2.0.0) Target Version/s: 2.0.0 Component/s: (was: Spark Core) SQL > Encoding/decoding of Array[Timestamp] fails > --- > > Key: SPARK-15321 > URL: https://issues.apache.org/jira/browse/SPARK-15321 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sumedh Mungee >Assignee: Sumedh Mungee > > In {{ExpressionEncoderSuite}}, if you add the following test case: > {code} > encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of > timestamp") > {code} > ... you will see that (without this fix) it fails with the following output: > {code} > - encode/decode for array of timestamp: [Ljava.sql.Timestamp;@fd9ebde *** > FAILED *** > Exception thrown while decoding > Converted: [0,100010,80001,52a7ccdc36800] > Schema: value#61615 > root > -- value: array (nullable = true) > |-- element: timestamp (containsNull = true) > Encoder: > class[value[0]: array] (ExpressionEncoderSuite.scala:312) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14990) nvl, coalesce, array functions with parameter of type "array"
[ https://issues.apache.org/jira/browse/SPARK-14990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292524#comment-15292524 ] Apache Spark commented on SPARK-14990: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13208 > nvl, coalesce, array functions with parameter of type "array" > - > > Key: SPARK-14990 > URL: https://issues.apache.org/jira/browse/SPARK-14990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Oleg Danilov >Priority: Minor > > Steps to reproduce: > 1. create table tmp(col1 int, col2 array) > 2. insert values: > {code} > 1, [1] > 2, [2] > 3, NULL > {code} > 3. run query > select col1, coalesce(col2, array(1,2,3)) from tmp; > Expected result: > {code} > 1, [1] > 2, [2] > 3, [1,2,3] > {code} > Current result: > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve > 'coalesce(col2,array(1,2,3))' due to data type mismatch: input to function > coalesce should all be the same type, but it's [array, array]; line > 1 pos 38 (state=,code=0) > {code} > The fix seems to be pretty easy, will create a PR soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-15416: - Assignee: Shixiong Zhu > Display a better message for not finding classes removed in Spark 2.0 > - > > Key: SPARK-15416 > URL: https://issues.apache.org/jira/browse/SPARK-15416 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > We removed some classes in Spark 2.0. If the user uses an incompatible > library, he may see ClassNotFoundException. It's better to give an > instruction to ask people using a correct version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-15416. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13201 [https://github.com/apache/spark/pull/13201] > Display a better message for not finding classes removed in Spark 2.0 > - > > Key: SPARK-15416 > URL: https://issues.apache.org/jira/browse/SPARK-15416 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu > Fix For: 2.0.0 > > > We removed some classes in Spark 2.0. If the user uses an incompatible > library, he may see ClassNotFoundException. It's better to give an > instruction to ask people using a correct version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys
[ https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12978: Target Version/s: 2.0.0 > Skip unnecessary final group-by when input data already clustered with > group-by keys > > > Key: SPARK-12978 > URL: https://issues.apache.org/jira/browse/SPARK-12978 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > > This ticket targets the optimization to skip an unnecessary group-by > operation below; > Without opt.: > {code} > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], > output=[col0#159,sum#200,sum#201,count#202L]) >+- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], > InMemoryRelation [col0#159,col1#160,col2#161], true, 1, > StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None > {code} > With opt.: > {code} > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation > [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, > true, 1), ConvertToUnsafe, None > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15425: Issue Type: Sub-task (was: Improvement) Parent: SPARK-15426 > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. > Note that the error message should be friendly and say "Set > spark.sql.join.enableCartesian to true to turn on cartesian joins." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15426) Spark 2.0 SQL API audit
Reynold Xin created SPARK-15426: --- Summary: Spark 2.0 SQL API audit Key: SPARK-15426 URL: https://issues.apache.org/jira/browse/SPARK-15426 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin This is an umbrella ticket to list issues I found with APIs for the 2.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15425: Description: It is fairly easy for users to shoot themselves in the foot if they run cartesian joins. Often they might not even be aware of the join methods chosen. This happened to me a few times in the last few weeks. It would be a good idea to disable cartesian joins by default, and require explicit enabling of it via "crossJoin" method or in SQL "cross join". This however might be too large of a scope for 2.0 given the timing. As a small and quick fix, we can just have a single config option (spark.sql.join.enableCartesian) that controls this behavior. In the future we can implement the fine-grained control. Note that the error message should be friendly and say "Set spark.sql.join.enableCartesian to true to turn on cartesian joins." was: It is fairly easy for users to shoot themselves in the foot if they run cartesian joins. Often they might not even be aware of the join methods chosen. This happened to me a few times in the last few weeks. It would be a good idea to disable cartesian joins by default, and require explicit enabling of it via "crossJoin" method or in SQL "cross join". This however might be too large of a scope for 2.0 given the timing. As a small and quick fix, we can just have a single config option (spark.sql.join.enableCartesian) that controls this behavior. In the future we can implement the fine-grained control. > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. > Note that the error message should be friendly and say "Set > spark.sql.join.enableCartesian to true to turn on cartesian joins." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default
[ https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15425: Target Version/s: 2.0.0 > Disallow cartesian joins by default > --- > > Key: SPARK-15425 > URL: https://issues.apache.org/jira/browse/SPARK-15425 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > It is fairly easy for users to shoot themselves in the foot if they run > cartesian joins. Often they might not even be aware of the join methods > chosen. This happened to me a few times in the last few weeks. > It would be a good idea to disable cartesian joins by default, and require > explicit enabling of it via "crossJoin" method or in SQL "cross join". This > however might be too large of a scope for 2.0 given the timing. As a small > and quick fix, we can just have a single config option > (spark.sql.join.enableCartesian) that controls this behavior. In the future > we can implement the fine-grained control. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15425) Disallow cartesian joins by default
Reynold Xin created SPARK-15425: --- Summary: Disallow cartesian joins by default Key: SPARK-15425 URL: https://issues.apache.org/jira/browse/SPARK-15425 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin It is fairly easy for users to shoot themselves in the foot if they run cartesian joins. Often they might not even be aware of the join methods chosen. This happened to me a few times in the last few weeks. It would be a good idea to disable cartesian joins by default, and require explicit enabling of it via "crossJoin" method or in SQL "cross join". This however might be too large of a scope for 2.0 given the timing. As a small and quick fix, we can just have a single config option (spark.sql.join.enableCartesian) that controls this behavior. In the future we can implement the fine-grained control. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15329) When start spark with yarn: spark.SparkContext: Error initializing SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-15329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15329. --- Resolution: Not A Problem > When start spark with yarn: spark.SparkContext: Error initializing > SparkContext. > -- > > Key: SPARK-15329 > URL: https://issues.apache.org/jira/browse/SPARK-15329 > Project: Spark > Issue Type: Bug > Components: EC2 >Reporter: Jon > > Hi, Im trying to start spark with yarn-client, like this "spark-shell > --master yarn-client" but Im getting the error below. > If I start spark just with "spark-shell" everything works fine. > I have a single node machine where I have all hadoop processes running, and a > hive metastore server running. > I already try more than 30 different configurations, but nothing is working, > the config that I have now is this: > core-site.xml: > > > fs.defaultFS > hdfs://masternode:9000 > > > hdfs-site.xml: > > > dfs.replication > 1 > > > yarn-site.xml: > > > yarn.resourcemanager.resource-tracker.address > masternode:8031 > > > yarn.resourcemanager.address > masternode:8032 > > > yarn.resourcemanager.scheduler.address > masternode:8030 > > > yarn.resourcemanager.admin.address > masternode:8033 > > > yarn.resourcemanager.webapp.address > masternode:8088 > > > About spark confs: > spark-env.sh: > HADOOP_CONF_DIR=/usr/local/hadoop-2.7.1/hadoop > SPARK_MASTER_IP=masternode > spark-defaults.conf > spark.master spark://masternode:7077 > spark.serializer org.apache.spark.serializer.KryoSerializer > Do you understand why this is happening? > hadoopadmin@mn:~$ spark-shell --master yarn-client > 16/05/14 23:21:07 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 16/05/14 23:21:07 INFO spark.SecurityManager: Changing view acls to: > hadoopadmin > 16/05/14 23:21:07 INFO spark.SecurityManager: Changing modify acls to: > hadoopadmin > 16/05/14 23:21:07 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hadoopadmin); > users with modify permissions: Set(hadoopadmin) > 16/05/14 23:21:08 INFO spark.HttpServer: Starting HTTP Server > 16/05/14 23:21:08 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/05/14 23:21:08 INFO server.AbstractConnector: Started > SocketConnector@0.0.0.0:36979 > 16/05/14 23:21:08 INFO util.Utils: Successfully started service 'HTTP class > server' on port 36979. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.1 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77) > Type in expressions to have them evaluated. > Type :help for more information. > 16/05/14 23:21:12 INFO spark.SparkContext: Running Spark version 1.6.1 > 16/05/14 23:21:12 INFO spark.SecurityManager: Changing view acls to: > hadoopadmin > 16/05/14 23:21:12 INFO spark.SecurityManager: Changing modify acls to: > hadoopadmin > 16/05/14 23:21:12 INFO spark.SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hadoopadmin); > users with modify permissions: Set(hadoopadmin) > 16/05/14 23:21:12 INFO util.Utils: Successfully started service 'sparkDriver' > on port 33128. > 16/05/14 23:21:13 INFO slf4j.Slf4jLogger: Slf4jLogger started > 16/05/14 23:21:13 INFO Remoting: Starting remoting > 16/05/14 23:21:13 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.15.0.11:34382] > 16/05/14 23:21:13 INFO util.Utils: Successfully started service > 'sparkDriverActorSystem' on port 34382. > 16/05/14 23:21:13 INFO spark.SparkEnv: Registering MapOutputTracker > 16/05/14 23:21:13 INFO spark.SparkEnv: Registering BlockManagerMaster > 16/05/14 23:21:13 INFO storage.DiskBlockManager: Created local directory at > /tmp/blockmgr-a0048199-bf2f-404b-9cd2-b5988367783f > 16/05/14 23:21:13 INFO storage.MemoryStore: MemoryStore started with capacity > 511.1 MB > 16/05/14 23:21:13 INFO spark.SparkEnv: Registering OutputCommitCoordinator > 16/05/14 23:21:13 INFO server.Server: jetty-8.y.z-SNAPSHOT > 16/05/14 23:21:13 INFO server.AbstractConnector: Started > SelectChannelConnector@0.0.0.0:4040 > 16/05/14 23:21:13 INFO util.Utils: Successfully started service 'SparkUI' on > port 4040. > 16/05/14 23:21:13 INFO ui.SparkUI: Started SparkUI at http://10.15.0.11:4040 > 16/05/14 23:21:14 INFO client.RMProxy: Connecting to ResourceManager at > localhost/127.0.0.1:8032 > 16/05/14 23:21:14 INFO yarn.Client: Requesting a new application from cluster > with 1 NodeManagers > 16/05/14 23:21:14 INFO yarn.Client: Verifying our application has not > requested
[jira] [Commented] (SPARK-15403) LinearRegressionWithSGD fails on files more than 12Mb data
[ https://issues.apache.org/jira/browse/SPARK-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292500#comment-15292500 ] Sean Owen commented on SPARK-15403: --- This looks like some problem in your Spark environment, like mismatched or conflicting builds of Spark. ClassNotFoundException wouldn't be something to do with data. > LinearRegressionWithSGD fails on files more than 12Mb data > --- > > Key: SPARK-15403 > URL: https://issues.apache.org/jira/browse/SPARK-15403 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04 with 8 Gb Ram, scala 2.11.7 with following > memory settings for my project: JAVA_OPTS="-Xmx8G -Xms2G" . >Reporter: Ana La >Priority: Blocker > > I parse my json-like data, passing by DataFrame and SparkSql facilities and > then scale one numerical feature and create dummy variables for categorical > features. So far from initial 14 keys of my json-like file I get about > 200-240 features in the final LabeledPoint. The final data is sparse and > every file contains as minimum 5 of observations. I try to run two types > of algorithms on data : LinearRegressionWithSGD or LassoWithSGD, since the > data is sparse and regularization might be required. For data larger than > 11MB LinearRegressionWithSGD fails with the following error: > {quote} org.apache.spark.SparkException: Job aborted due to stage failure: > Task 58 in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in > stage 346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver > exited caused by one of the running tasks) Reason: Executor heartbeat timed > out after 179307 ms {quote} > I tried to reproduce this bug with bug in smaller example, and I suppose that > something wrong could be with LinearRegressionWithSGD on large sets of data. > I notices that while using StandardScaler on preprocessing step and counts on > Linear Regression step, collect() method is perform, that can cause the bug. > So the possibility to scale Linear regression is questioned, cause, as I far > as I understand it, collect() performs on driver and so the sens of scaled > calculations is lost. > {code:scala} > import java.io.{File} > import org.apache.spark.mllib.linalg.{Vectors} > import org.apache.spark.mllib.regression.{LabeledPoint, LassoWithSGD} > import org.apache.spark.rdd.RDD > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.{SQLContext} > import scala.language.postfixOps > object Main2 { > def main(args: Array[String]): Unit = { > // Spark configuration is defined for execution on local computer, 4 > cores 8Mb Ram > val conf = new SparkConf() > .setMaster(s"local[*]") > .setAppName("spark_linear_regression_bug_report") > //multiple configurations were tried for driver/executor memories, > including default configurations > .set("spark.driver.memory", "3g") > .set("spark.executor.memory", "3g") > .set("spark.executor.heartbeatInterval", "30s") > // Spark context and SQL context definitions > val sc = new SparkContext(conf) > val sqlContext = new SQLContext(sc) > val countFeatures = 500 > val countList = 50 > val features = sc.broadcast(1 to countFeatures) > val rdd: RDD[LabeledPoint] = sc.range(1, countList).map { i => > LabeledPoint( > label = i.toDouble, > features = Vectors.dense(features.value.map(_ => > scala.util.Random.nextInt(2).toDouble).toArray) > ) > }.persist() > val numIterations = 1000 > val stepSize = 0.3 > val algorithm = new LassoWithSGD() //LassoWithSGD() > algorithm.setIntercept(true) > algorithm.optimizer > .setNumIterations(numIterations) > .setStepSize(stepSize) > val model = algorithm.run(rdd) > } > } > {code} > the complete Error of the bug : > {quote} [info] Running Main > WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > WARN org.apache.spark.util.Utils - Your hostname, julien-ubuntu resolves to > a loopback address: 127.0.1.1; using 192.168.0.49 instead (on interface wlan0) > WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to > another address > INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started > INFO Remoting - Starting remoting > INFO Remoting - Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@192.168.0.49:59897] > INFO org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT > INFO org.spark-project.jetty.server.AbstractConnector - Started > SelectChannelConnector@0.0.0.0:4040 > WARN com.github.fommil.netlib.BLAS - Failed to load
[jira] [Commented] (SPARK-14989) Upgrade to Jackson 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292497#comment-15292497 ] Sean Owen commented on SPARK-14989: --- See https://github.com/apache/spark/pull/9759 by the way. I think the problem is using something from 2.7 but then executing in an environment that provides 2.4 or 2.5. > Upgrade to Jackson 2.7.3 > > > Key: SPARK-14989 > URL: https://issues.apache.org/jira/browse/SPARK-14989 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > For Spark 2.0, we should upgrade to a newer version of Jackson (2.7.3). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15404) pyspark sql bug ,here is the testcase
[ https://issues.apache.org/jira/browse/SPARK-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15404. --- Resolution: Invalid Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > pyspark sql bug ,here is the testcase > - > > Key: SPARK-15404 > URL: https://issues.apache.org/jira/browse/SPARK-15404 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: 郭同 > > import os > import sys > from pyspark import SparkContext > from pyspark.sql import SQLContext > from pyspark.sql.types import Row, StructField, StructType, StringType, > IntegerType > if __name__ == "__main__": > sc = SparkContext(appName="PythonSQL") > sqlContext = SQLContext(sc) > schema = StructType([StructField("person_name", StringType(), False), > StructField("person_age", IntegerType(), False)]) > some_rdd = sc.parallelize([Row(person_name="John", person_age=19), >Row(person_name="Smith", person_age=23), >Row(person_name="Sarah", person_age=18)]) > some_df = sqlContext.createDataFrame(some_rdd, schema) > some_df.printSchema() > some_df.registerAsTable("people") > teenagers = sqlContext.sql("SELECT * FROM people ") > for each in teenagers.collect(): > print(each) > sc.stop() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15423) why it is very slow to clean resources in Spark
[ https://issues.apache.org/jira/browse/SPARK-15423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15423. --- Resolution: Invalid Fix Version/s: (was: 2.0.0) Target Version/s: (was: 2.0.0) Questions go to user@ Right now there's no sign it's a Spark problem per se. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark fist > why it is very slow to clean resources in Spark > --- > > Key: SPARK-15423 > URL: https://issues.apache.org/jira/browse/SPARK-15423 > Project: Spark > Issue Type: Question > Components: Block Manager, MLlib >Affects Versions: 2.0.0 > Environment: RedHat 6.5 (64 bit), JDK 1.8, Standalone mode >Reporter: zszhong > Labels: newbie, starter > > Hi, everyone! I'm new to Spark. Originally I submitted a post in > [http://stackoverflow.com/questions/37331226/why-it-is-very-slow-to-clean-resources-in-spark], > but somebody think that it is off-topic. Thus I post here to ask for your > help. If this post is not related here, please feel free to delete it. I just > copy the content here, I don't know how to edit the code to be more readable, > so please refer to the link in stackoverflow. > I've submitted a very simple task into a standalone Spark environment > (`spark-2.0.0-preview`, `jdk 1.8`, `48 CPU cores`, `250 Gb memory`) with the > following command: > bin/spark-submit.sh --master spark://hostname.domain:7077 --conf > "spark.executor.memory=8G" ../SimpleApp.py ../data/train/ ../data/val/ > where the `SimpleApp.py` is: > from __future__ import print_function > import sys > import os > import numpy as np > from pyspark import SparkContext > from pyspark.mllib.tree import RandomForest, RandomForestModel > from pyspark.mllib.util import MLUtils > trainDataPath = sys.argv[1] > valDataPath = sys.argv[2] > sc = SparkContext(appName="Classification using Spark Random Forest") > trainData = MLUtils.loadLibSVMFile(sc, trainDataPath) > valData = MLUtils.loadLibSVMFile(sc, valDataPath) >model = RandomForest.trainClassifier(trainData, numClasses=6, > categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", > impurity='gini', maxDepth=4, maxBins=32) > predictions = model.predict(valData.map(lambda x: x.features)) > labelsAndPredictions = valData.map(lambda lp: > lp.label).zip(predictions) > testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() > / float(valData.count()) > print('Test Error = ' + str(testErr)) > And the task is running OK and can output the `Test Error` as follows: > Test Error = 0.380580779161 > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on > 127.0.0.1:59714 in memory (size: 12.1 KB, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on > 127.0.0.1:37978 in memory (size: 12.1 KB, free: 4.5 GB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on > 127.0.0.1:37978 in memory (size: 10.9 KB, free: 4.5 GB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on > 127.0.0.1:59714 in memory (size: 10.9 KB, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on > 127.0.0.1:59714 in memory (size: 4.6 KB, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on > 127.0.0.1:37978 in memory (size: 4.6 KB, free: 4.5 GB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on > 127.0.0.1:59714 in memory (size: 4.0 KB, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on > 127.0.0.1:37978 in memory (size: 4.0 KB, free: 4.5 GB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on > 127.0.0.1:59714 in memory (size: 455.0 B, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on > 127.0.0.1:37978 in memory (size: 455.0 B, free: 4.5 GB) > 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 4 > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on > 127.0.0.1:59714 in memory (size: 9.2 KB, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on > 127.0.0.1:37978 in memory (size: 9.2 KB, free: 4.5 GB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on > 127.0.0.1:59714 in memory (size: 3.6 KB, free: 511.1 MB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on > 127.0.0.1:37978 in memory (size: 3.6 KB, free: 4.5 GB) > 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on > 127.0.0.1:59714 in memory (size: 389.0 B, free: 511.1 MB) > 16/05/20
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292488#comment-15292488 ] Mete Kural commented on SPARK-3821: --- Thank you for the response Nicholas. spark-ec2 does take care of AMIs for ec2 and in fact is documented in Spark documentation as a deployment method along with distribution with Spark. However, the same level of presence doesn't seem to exist for Docker as a deployment method. What's inside the docker folder in Spark is not really in shape for a production deployment, not documented in Spark documentation either, and doesn't seem to have been worked on in quite a while. It seems the only way the Spark project officially supports running Spark on Docker is via Mesos, would you say that is correct? With Docker becoming an industry standard as of a month ago, I hope there will be renewed interest within the Spark project in supporting Docker as an official deployment method without the Mesos requirement. > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved
[ https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15341: -- Assignee: Yanbo Liang > Add documentation for `model.write` to clarify `summary` was not saved > --- > > Key: SPARK-15341 > URL: https://issues.apache.org/jira/browse/SPARK-15341 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Currently in model.write, we don't save summary(if applicable). We should add > documentation to clarify it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved
[ https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15341: -- Fix Version/s: 2.0.0 > Add documentation for `model.write` to clarify `summary` was not saved > --- > > Key: SPARK-15341 > URL: https://issues.apache.org/jira/browse/SPARK-15341 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.0.0 > > > Currently in model.write, we don't save summary(if applicable). We should add > documentation to clarify it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15375) Add ConsoleSink for structure streaming to display the dataframe on the fly
[ https://issues.apache.org/jira/browse/SPARK-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15375. - Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.0.0 > Add ConsoleSink for structure streaming to display the dataframe on the fly > --- > > Key: SPARK-15375 > URL: https://issues.apache.org/jira/browse/SPARK-15375 > Project: Spark > Issue Type: New Feature > Components: SQL, Streaming >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.0.0 > > > Add a ConsoleSink for structure streaming, user could specify like other sink > and display on the console. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public
[ https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15414: -- Assignee: Sandeep Singh > Make the mllib,ml linalg type conversion APIs public > > > Key: SPARK-15414 > URL: https://issues.apache.org/jira/browse/SPARK-15414 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Sandeep Singh > Fix For: 2.0.0 > > > We should open up the APIs for converting between new, old linear algebra > types (in spark.mllib.linalg): > * Vector.asML > * Vectors.fromML > * same for Sparse/Dense and for Matrices > I made these private originally, but they will be useful for users > transitioning workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public
[ https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15414. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13202 [https://github.com/apache/spark/pull/13202] > Make the mllib,ml linalg type conversion APIs public > > > Key: SPARK-15414 > URL: https://issues.apache.org/jira/browse/SPARK-15414 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Joseph K. Bradley > Fix For: 2.0.0 > > > We should open up the APIs for converting between new, old linear algebra > types (in spark.mllib.linalg): > * Vector.asML > * Vectors.fromML > * same for Sparse/Dense and for Matrices > I made these private originally, but they will be useful for users > transitioning workloads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292403#comment-15292403 ] Xiangrui Meng commented on SPARK-15363: --- No. I think we need to make the converters between new and old vectors public (WIP) and then in example code, we don't need implicits. Another option is to make implicits public. > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module
[ https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292381#comment-15292381 ] Apache Spark commented on SPARK-15424: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13207 > Revert SPARK-14807 Create a hivecontext-compatibility module > > > Key: SPARK-15424 > URL: https://issues.apache.org/jira/browse/SPARK-15424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > I initially asked to create a hivecontext-compatibility module to put the > HiveContext there. But we are so close to Spark 2.0 release and there is only > a single class in it. It seems overkill to have an entire package, which > makes it more inconvenient, for a single class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module
[ https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15424: Assignee: Reynold Xin (was: Apache Spark) > Revert SPARK-14807 Create a hivecontext-compatibility module > > > Key: SPARK-15424 > URL: https://issues.apache.org/jira/browse/SPARK-15424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > I initially asked to create a hivecontext-compatibility module to put the > HiveContext there. But we are so close to Spark 2.0 release and there is only > a single class in it. It seems overkill to have an entire package, which > makes it more inconvenient, for a single class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14807) Create a hivecontext-compatibility module
[ https://issues.apache.org/jira/browse/SPARK-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292384#comment-15292384 ] Apache Spark commented on SPARK-14807: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13207 > Create a hivecontext-compatibility module > - > > Key: SPARK-14807 > URL: https://issues.apache.org/jira/browse/SPARK-14807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > In 2.0, SparkSession will replace SQLContext/HiveContext. We will move > HiveContext to a compatibility module and users can optionally use this > module to access HiveContext. > This jira is to create this compatibility module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module
[ https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15424: Assignee: Apache Spark (was: Reynold Xin) > Revert SPARK-14807 Create a hivecontext-compatibility module > > > Key: SPARK-15424 > URL: https://issues.apache.org/jira/browse/SPARK-15424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Fix For: 2.0.0 > > > I initially asked to create a hivecontext-compatibility module to put the > HiveContext there. But we are so close to Spark 2.0 release and there is only > a single class in it. It seems overkill to have an entire package, which > makes it more inconvenient, for a single class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module
Reynold Xin created SPARK-15424: --- Summary: Revert SPARK-14807 Create a hivecontext-compatibility module Key: SPARK-15424 URL: https://issues.apache.org/jira/browse/SPARK-15424 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin I initially asked to create a hivecontext-compatibility module to put the HiveContext there. But we are so close to Spark 2.0 release and there is only a single class in it. It seems overkill to have an entire package, which makes it more inconvenient, for a single class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14807) Create a hivecontext-compatibility module
[ https://issues.apache.org/jira/browse/SPARK-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14807: Summary: Create a hivecontext-compatibility module (was: Create a compatibility module) > Create a hivecontext-compatibility module > - > > Key: SPARK-14807 > URL: https://issues.apache.org/jira/browse/SPARK-14807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > In 2.0, SparkSession will replace SQLContext/HiveContext. We will move > HiveContext to a compatibility module and users can optionally use this > module to access HiveContext. > This jira is to create this compatibility module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15423) why it is very slow to clean resources in Spark
zszhong created SPARK-15423: --- Summary: why it is very slow to clean resources in Spark Key: SPARK-15423 URL: https://issues.apache.org/jira/browse/SPARK-15423 Project: Spark Issue Type: Question Components: Block Manager, MLlib Affects Versions: 2.0.0 Environment: RedHat 6.5 (64 bit), JDK 1.8, Standalone mode Reporter: zszhong Fix For: 2.0.0 Hi, everyone! I'm new to Spark. Originally I submitted a post in [http://stackoverflow.com/questions/37331226/why-it-is-very-slow-to-clean-resources-in-spark], but somebody think that it is off-topic. Thus I post here to ask for your help. If this post is not related here, please feel free to delete it. I just copy the content here, I don't know how to edit the code to be more readable, so please refer to the link in stackoverflow. I've submitted a very simple task into a standalone Spark environment (`spark-2.0.0-preview`, `jdk 1.8`, `48 CPU cores`, `250 Gb memory`) with the following command: bin/spark-submit.sh --master spark://hostname.domain:7077 --conf "spark.executor.memory=8G" ../SimpleApp.py ../data/train/ ../data/val/ where the `SimpleApp.py` is: from __future__ import print_function import sys import os import numpy as np from pyspark import SparkContext from pyspark.mllib.tree import RandomForest, RandomForestModel from pyspark.mllib.util import MLUtils trainDataPath = sys.argv[1] valDataPath = sys.argv[2] sc = SparkContext(appName="Classification using Spark Random Forest") trainData = MLUtils.loadLibSVMFile(sc, trainDataPath) valData = MLUtils.loadLibSVMFile(sc, valDataPath) model = RandomForest.trainClassifier(trainData, numClasses=6, categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", impurity='gini', maxDepth=4, maxBins=32) predictions = model.predict(valData.map(lambda x: x.features)) labelsAndPredictions = valData.map(lambda lp: lp.label).zip(predictions) testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(valData.count()) print('Test Error = ' + str(testErr)) And the task is running OK and can output the `Test Error` as follows: Test Error = 0.380580779161 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 127.0.0.1:59714 in memory (size: 12.1 KB, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 127.0.0.1:37978 in memory (size: 12.1 KB, free: 4.5 GB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 127.0.0.1:37978 in memory (size: 10.9 KB, free: 4.5 GB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 127.0.0.1:59714 in memory (size: 10.9 KB, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 127.0.0.1:59714 in memory (size: 4.6 KB, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 127.0.0.1:37978 in memory (size: 4.6 KB, free: 4.5 GB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 127.0.0.1:59714 in memory (size: 4.0 KB, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 127.0.0.1:37978 in memory (size: 4.0 KB, free: 4.5 GB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 127.0.0.1:59714 in memory (size: 455.0 B, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 127.0.0.1:37978 in memory (size: 455.0 B, free: 4.5 GB) 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 4 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 127.0.0.1:59714 in memory (size: 9.2 KB, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 127.0.0.1:37978 in memory (size: 9.2 KB, free: 4.5 GB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 127.0.0.1:59714 in memory (size: 3.6 KB, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 127.0.0.1:37978 in memory (size: 3.6 KB, free: 4.5 GB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 127.0.0.1:59714 in memory (size: 389.0 B, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 127.0.0.1:37978 in memory (size: 389.0 B, free: 4.5 GB) 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 3 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_12_piece0 on 127.0.0.1:59714 in memory (size: 345.0 B, free: 511.1 MB) 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_12_piece0 on 127.0.0.1:37978 in memory (size: 345.0 B, free: 4.5 GB) 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 2 16/05/20 01:04:52 INFO BlockManager: Removing RDD 19 16/05/20 01:04:52 INFO ContextCleaner:
[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML
[ https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292356#comment-15292356 ] Miao Wang commented on SPARK-15363: --- [~mengxr] I want to solve this problem by copying MultivariateOnlineSummarizer.scala to mllib-local using ml.vectors. So in the example, we don't have to import the VectorImplicits._. Do you think it is the fix that you want? Thanks! > Example code shouldn't use VectorImplicits._, asML/fromML > - > > Key: SPARK-15363 > URL: https://issues.apache.org/jira/browse/SPARK-15363 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Xiangrui Meng > > In SPARK-14615, we use VectorImplicits._ and asML in example code to > minimize the changes in that PR. However, this is a private API, which > shouldn't appear in the example code. We should consider update them during > QA. > https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15419) monotonicallyIncreasingId should use less memory with multiple partitions
[ https://issues.apache.org/jira/browse/SPARK-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-15419. - Resolution: Duplicate Assignee: Shixiong Zhu Fix Version/s: 2.0.0 Confirmed that the patch for [SPARK-15317] fixes this issue, or at least makes it much less significant. > monotonicallyIncreasingId should use less memory with multiple partitions > - > > Key: SPARK-15419 > URL: https://issues.apache.org/jira/browse/SPARK-15419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: branch-2.0, 1 worker >Reporter: Joseph K. Bradley >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > When monotonicallyIncreasingId is used on a DataFrame with many partitions, > it uses a very large amount of memory. > Consider this code: > {code} > import org.apache.spark.sql.functions._ > // JMAP1: run jmap -histo:live [PID] > val numPartitions = 1000 > val df = spark.range(0, 100, 1, numPartitions).toDF("vtx") > df.cache().count() > // JMAP2 > val df2 = df.withColumn("id", monotonicallyIncreasingId()) > df2.cache().count() > // JMAP3 > df2.select(col("id") + 1).count() > // JMAP4 > {code} > Here's how memory usage progresses: > * JMAP1: This is just for calibration. > * JMAP2: No significant change from 1. > * JMAP3: Massive jump: 3048895 Longs, 1039638 Objects, 2007427 Integers, > 1002000 org.apache.spark.sql.catalyst.expressions.GenericInternalRow > ** None of these had significant numbers of instances in JMAP1/2. > * JMAP4: This doubles the object creation. I.e., even after caching, it > keeps generating new objects on every use. > When the indexed DataFrame is used repeatedly afterwards, the driver memory > usage keeps increasing and eventually blows up in my application. > I wrote "with multiple partitions" because this issue goes away when > numPartitions is small (1 or 2). > Presumably this memory usage could be reduced. > Note: I also tested a custom indexing using RDD.zipWithIndex, and it is even > worse in terms of object creation (about 2x worse): > {code} > def zipWithUniqueIdFrom0(df: DataFrame): DataFrame = { > val sqlContext = df.sqlContext > val schema = df.schema > val outputSchema = StructType(Seq( > StructField("row", schema, false), StructField("id", > DataTypes.IntegerType, false))) > val rdd = df.rdd.zipWithIndex().map { case (row: Row, id: Long) => Row(row, > id.toInt) } > sqlContext.createDataFrame(rdd, outputSchema) > } > // val df2 = df.withColumn("id", monotonicallyIncreasingId()) > val df2 = zipWithUniqueIdFrom0(df) > df2.cache().count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14734) Add conversions between mllib and ml Vector, Matrix types
[ https://issues.apache.org/jira/browse/SPARK-14734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292340#comment-15292340 ] DB Tsai commented on SPARK-14734: - We will deprecate the mllib methods in the following releases. As a result, we would like to have a clear cut between two APIs. I agree that it's debatable to make it public or not. My concern is once they're public, it's hard to go back. > Add conversions between mllib and ml Vector, Matrix types > - > > Key: SPARK-14734 > URL: https://issues.apache.org/jira/browse/SPARK-14734 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.0.0 > > > For maintaining wrappers around spark.mllib algorithms in spark.ml, it will > be useful to have {{private[spark]}} methods for converting from one linear > algebra representation to another. I am running into this issue in > [SPARK-14732]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292338#comment-15292338 ] Apache Spark commented on SPARK-15420: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/13206 > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292339#comment-15292339 ] Apache Spark commented on SPARK-15345: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13200 > SparkSession's conf doesn't take effect when there's already an existing > SparkContext > - > > Key: SPARK-15345 > URL: https://issues.apache.org/jira/browse/SPARK-15345 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Piotr Milanowski >Priority: Blocker > > I am working with branch-2.0, spark is compiled with hive support (-Phive and > -Phvie-thriftserver). > I am trying to access databases using this snippet: > {code} > from pyspark.sql import HiveContext > hc = HiveContext(sc) > hc.sql("show databases").collect() > [Row(result='default')] > {code} > This means that spark doesn't find any databases specified in configuration. > Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark > 1.6, and launching above snippet, I can print out existing databases. > When run in DEBUG mode this is what spark (2.0) prints out: > {code} > 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases > 16/05/16 12:17:47 DEBUG SimpleAnalyzer: > === Result of Batch Resolution === > !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, > string])) null else input[0, string].toString, > StructField(result,StringType,false)), result#2) AS #3] Project > [createexternalrow(if (isnull(result#2)) null else result#2.toString, > StructField(result,StringType,false)) AS #3] > +- LocalRelation [result#2] > > +- LocalRelation [result#2] > > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.Dataset$$anonfun$53) +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: private final > org.apache.spark.sql.types.StructType > org.apache.spark.sql.Dataset$$anonfun$53.structType$1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure > (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure > (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) > +++ > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID > 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) > 16/05/16 12:17:47 DEBUG ClosureCleaner: public final > org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler > org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) > 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because > this is the starting closure > 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting > closure: 0 > 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! > 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure >
[jira] [Assigned] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15420: Assignee: (was: Apache Spark) > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15420) Repartition and sort before Parquet writes
[ https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15420: Assignee: Apache Spark > Repartition and sort before Parquet writes > -- > > Key: SPARK-15420 > URL: https://issues.apache.org/jira/browse/SPARK-15420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ryan Blue >Assignee: Apache Spark > > Parquet requires buffering data in memory before writing a group of rows > organized by column. This causes significant memory pressure when writing > partitioned output because each open file must buffer rows. > Currently, Spark will sort data and spill if necessary in the > {{WriterContainer}} to avoid keeping many files open at once. But, this isn't > a full solution for a few reasons: > * The final sort is always performed, even if incoming data is already sorted > correctly. For example, a global sort will cause two sorts to happen, even if > the global sort correctly prepares the data. > * To prevent a large number of output small output files, users must manually > add a repartition step. That step is also ignored by the sort within the > writer. > * Hive does not currently support {{DataFrameWriter#sortBy}} > The sort in {{WriterContainer}} makes sense to prevent problems, but should > detect if the incoming data is already sorted. The {{DataFrameWriter}} should > also expose the ability to repartition data before the write stage, and the > query planner should expose an option to automatically insert repartition > operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14734) Add conversions between mllib and ml Vector, Matrix types
[ https://issues.apache.org/jira/browse/SPARK-14734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292335#comment-15292335 ] Koert Kuipers commented on SPARK-14734: --- Why does making them public make it harder for users to migrate their application? Currently you make it harder to upgrade existing applications to spark 2.0 If you want to move people away from mllib to ml the way to do so is to deprecate mllib methods, not make life hard by leaving out methods, i think? > Add conversions between mllib and ml Vector, Matrix types > - > > Key: SPARK-14734 > URL: https://issues.apache.org/jira/browse/SPARK-14734 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > Fix For: 2.0.0 > > > For maintaining wrappers around spark.mllib algorithms in spark.ml, it will > be useful to have {{private[spark]}} methods for converting from one linear > algebra representation to another. I am running into this issue in > [SPARK-14732]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15422) Remove unnecessary calculation of stage's parents
[ https://issues.apache.org/jira/browse/SPARK-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15422: Assignee: Apache Spark > Remove unnecessary calculation of stage's parents > - > > Key: SPARK-15422 > URL: https://issues.apache.org/jira/browse/SPARK-15422 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: sharkd tu >Assignee: Apache Spark > > Remove unnecessary calculation of stage's parents, because stage's parents > have been set at the time of stage construction. > See https://github.com/apache/spark/pull/13123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15422) Remove unnecessary calculation of stage's parents
[ https://issues.apache.org/jira/browse/SPARK-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292331#comment-15292331 ] Apache Spark commented on SPARK-15422: -- User 'sharkdtu' has created a pull request for this issue: https://github.com/apache/spark/pull/13123 > Remove unnecessary calculation of stage's parents > - > > Key: SPARK-15422 > URL: https://issues.apache.org/jira/browse/SPARK-15422 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: sharkd tu > > Remove unnecessary calculation of stage's parents, because stage's parents > have been set at the time of stage construction. > See https://github.com/apache/spark/pull/13123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15422) Remove unnecessary calculation of stage's parents
[ https://issues.apache.org/jira/browse/SPARK-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15422: Assignee: (was: Apache Spark) > Remove unnecessary calculation of stage's parents > - > > Key: SPARK-15422 > URL: https://issues.apache.org/jira/browse/SPARK-15422 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: sharkd tu > > Remove unnecessary calculation of stage's parents, because stage's parents > have been set at the time of stage construction. > See https://github.com/apache/spark/pull/13123 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org