[jira] [Resolved] (SPARK-15313) EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15313.
-
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 2.0.0

> EmbedSerializerInFilter rule should keep exprIds of output of surrounded 
> SerializeFromObject.
> -
>
> Key: SPARK-15313
> URL: https://issues.apache.org/jira/browse/SPARK-15313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.0.0
>
>
> The following code:
> {code}
> val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
> ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
> {code}
> throws an Exception:
> {noformat}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _1#420
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:254)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>  at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>  at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
>  at 
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>  at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>  at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>  at 
> org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:79)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> 

[jira] [Comment Edited] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

2016-05-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292750#comment-15292750
 ] 

Hyukjin Kwon edited comment on SPARK-15393 at 5/20/16 5:51 AM:
---

[~jurriaanpruis]
Hm.. I am trying to reproduce this exceptions.

I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this 
**after/before my PR**:

{code}
  test("SPARK-15393: create empty file") {
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
  withTempPath { path =>
val schema = StructType(
  StructField("k", StringType, true) ::
  StructField("v", IntegerType, false) :: Nil)
val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
schema)
emptyDf.write
  .format("parquet")
  .save(path.getCanonicalPath)

val copyEmptyDf = spark.read
  .format("parquet")
  .load(path.getCanonicalPath)

copyEmptyDf.show()
  }
}
  }
{code}

I could reproduce the exceptions when reading but could not reproduce the 
exceptions when it writes yet on both ones after/before the PR. (I run more 
than 10 times on ones after/before my PR)

It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be 
the reason for this exceptions.

Do you mind if I ask the codes you run? 


was (Author: hyukjin.kwon):
[~jurriaanpruis]
Hm.. I am trying to reproduce this exceptions.

I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this 
after/before my PR:

{code}
  test("SPARK-15393: create empty file") {
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
  withTempPath { path =>
val schema = StructType(
  StructField("k", StringType, true) ::
  StructField("v", IntegerType, false) :: Nil)
val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
schema)
emptyDf.write
  .format("parquet")
  .save(path.getCanonicalPath)

val copyEmptyDf = spark.read
  .format("parquet")
  .load(path.getCanonicalPath)

copyEmptyDf.show()
  }
}
  }
{code}

I could reproduce the exceptions when reading but could not reproduce the 
exceptions when it writes yet on both ones after/before the PR. (I run more 
than 10 times on ones after/before my PR)

It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be 
the reason for this exceptions.

Do you mind if I ask the codes you run? 

> Writing empty Dataframes doesn't save any _metadata files
> -
>
> Key: SPARK-15393
> URL: https://issues.apache.org/jira/browse/SPARK-15393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> 

[jira] [Created] (SPARK-15434) improve EmbedSerializerInFilter rule

2016-05-19 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-15434:
---

 Summary: improve EmbedSerializerInFilter rule
 Key: SPARK-15434
 URL: https://issues.apache.org/jira/browse/SPARK-15434
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

2016-05-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292750#comment-15292750
 ] 

Hyukjin Kwon commented on SPARK-15393:
--

Hm.. I am trying to reproduce this exceptions.

I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this 
after/before my PR:

{code}
  test("SPARK-15393: create empty file") {
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
  withTempPath { path =>
val schema = StructType(
  StructField("k", StringType, true) ::
  StructField("v", IntegerType, false) :: Nil)
val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
schema)
emptyDf.write
  .format("parquet")
  .save(path.getCanonicalPath)

val copyEmptyDf = spark.read
  .format("parquet")
  .load(path.getCanonicalPath)

copyEmptyDf.show()
  }
}
  }
{code}

I could reproduce the exceptions when reading but could not reproduce the 
exceptions when it writes yet on both ones after/before the PR. (I run more 
than 10 times on ones after/before my PR)

It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be 
the reason for this exceptions.

Do you mind if I ask the codes you run? 

> Writing empty Dataframes doesn't save any _metadata files
> -
>
> Key: SPARK-15393
> URL: https://issues.apache.org/jira/browse/SPARK-15393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
> at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 

[jira] [Comment Edited] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

2016-05-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292750#comment-15292750
 ] 

Hyukjin Kwon edited comment on SPARK-15393 at 5/20/16 5:36 AM:
---

[~jurriaanpruis]
Hm.. I am trying to reproduce this exceptions.

I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this 
after/before my PR:

{code}
  test("SPARK-15393: create empty file") {
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
  withTempPath { path =>
val schema = StructType(
  StructField("k", StringType, true) ::
  StructField("v", IntegerType, false) :: Nil)
val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
schema)
emptyDf.write
  .format("parquet")
  .save(path.getCanonicalPath)

val copyEmptyDf = spark.read
  .format("parquet")
  .load(path.getCanonicalPath)

copyEmptyDf.show()
  }
}
  }
{code}

I could reproduce the exceptions when reading but could not reproduce the 
exceptions when it writes yet on both ones after/before the PR. (I run more 
than 10 times on ones after/before my PR)

It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be 
the reason for this exceptions.

Do you mind if I ask the codes you run? 


was (Author: hyukjin.kwon):
Hm.. I am trying to reproduce this exceptions.

I added a test in {{ParquetHadoopFsRelationSuite}} as below and I run this 
after/before my PR:

{code}
  test("SPARK-15393: create empty file") {
withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
  withTempPath { path =>
val schema = StructType(
  StructField("k", StringType, true) ::
  StructField("v", IntegerType, false) :: Nil)
val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
schema)
emptyDf.write
  .format("parquet")
  .save(path.getCanonicalPath)

val copyEmptyDf = spark.read
  .format("parquet")
  .load(path.getCanonicalPath)

copyEmptyDf.show()
  }
}
  }
{code}

I could reproduce the exceptions when reading but could not reproduce the 
exceptions when it writes yet on both ones after/before the PR. (I run more 
than 10 times on ones after/before my PR)

It seems https://github.com/apache/spark/pull/12855 (SPARK-10216) might not be 
the reason for this exceptions.

Do you mind if I ask the codes you run? 

> Writing empty Dataframes doesn't save any _metadata files
> -
>
> Key: SPARK-15393
> URL: https://issues.apache.org/jira/browse/SPARK-15393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>

[jira] [Updated] (SPARK-15057) Remove stale TODO comment for making `enum` in GraphGenerators

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15057:

Fix Version/s: 2.0.0

> Remove stale TODO comment for making `enum` in GraphGenerators
> --
>
> Key: SPARK-15057
> URL: https://issues.apache.org/jira/browse/SPARK-15057
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.0.0
>
>
> This PR removes a stale TODO comment in GraphGenerators.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15057) Remove stale TODO comment for making `enum` in GraphGenerators

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15057:

Fix Version/s: (was: 2.1.0)

> Remove stale TODO comment for making `enum` in GraphGenerators
> --
>
> Key: SPARK-15057
> URL: https://issues.apache.org/jira/browse/SPARK-15057
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.0.0
>
>
> This PR removes a stale TODO comment in GraphGenerators.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14261:

Assignee: Oleg Danilov

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
>Assignee: Oleg Danilov
> Fix For: 1.6.2, 2.0.0
>
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, 
> 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14261) Memory leak in Spark Thrift Server

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14261.
-
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.2

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Fix For: 1.6.2, 2.0.0
>
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, 
> 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15433:


Assignee: Apache Spark

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292739#comment-15292739
 ] 

Apache Spark commented on SPARK-15433:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/13214

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15433:


Assignee: (was: Apache Spark)

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-19 Thread Miao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miao Wang updated SPARK-15433:
--
Comment: was deleted

(was: [~viirya] If you are not working on this one, I would like to take it.

Thanks!)

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292738#comment-15292738
 ] 

Miao Wang commented on SPARK-15433:
---

[~viirya] If you are not working on this one, I would like to take it.

Thanks!

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-05-19 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-15433:
---

 Summary: PySpark core test should not use SerDe from PythonMLLibAPI
 Key: SPARK-15433
 URL: https://issues.apache.org/jira/browse/SPARK-15433
 Project: Spark
  Issue Type: Test
  Components: PySpark
Reporter: Liang-Chi Hsieh
Priority: Minor


Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14990) nvl, coalesce, array functions with parameter of type "array"

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14990.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

> nvl, coalesce, array functions with parameter of type "array"
> -
>
> Key: SPARK-14990
> URL: https://issues.apache.org/jira/browse/SPARK-14990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oleg Danilov
>Assignee: Reynold Xin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Steps to reproduce:
> 1. create table tmp(col1 int, col2 array)
> 2. insert values:
> {code}
>   1, [1]
>   2, [2]
>   3, NULL
> {code}
> 3. run query 
> select col1, coalesce(col2, array(1,2,3)) from tmp;
> Expected result:
> {code}
> 1, [1]
> 2, [2]
> 3, [1,2,3]
> {code}
> Current result:
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'coalesce(col2,array(1,2,3))' due to data type mismatch: input to function 
> coalesce should all be the same type, but it's [array, array]; line 
> 1 pos 38 (state=,code=0)
> {code}
> The fix seems to be pretty easy, will create a PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15363:


Assignee: (was: Apache Spark)

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292723#comment-15292723
 ] 

Apache Spark commented on SPARK-15363:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/13213

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15363:


Assignee: Apache Spark

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292719#comment-15292719
 ] 

Maciej Bryński commented on SPARK-15345:


Will try to test it with [~m1lan] today.



> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> 

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292717#comment-15292717
 ] 

Reynold Xin commented on SPARK-15345:
-

Haven't tested it :)

I am a little bit busy later today. Do you want to take a look at the Python 
part?

Also cc [~andrewor14] who might be able to work on the Python part.


> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no 

[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292715#comment-15292715
 ] 

Maciej Bryński commented on SPARK-15345:


[~rxin] Are you planning another PR for Python as in comment ?
"I updated Python docs. The Python change seems slightly larger and since it is 
not user facing, I'm going to defer it to another pr."

Or should I assume that Python part is working ?

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by 

[jira] [Resolved] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15345.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

I think I have fixed it in https://github.com/apache/spark/pull/13200. If there 
is still a problem, please reopen. Thanks.


> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 2.0.0
>
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing 

[jira] [Resolved] (SPARK-15075) Cleanup dependencies between SQLContext and SparkSession

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15075.
-
Resolution: Fixed
  Assignee: Reynold Xin

> Cleanup dependencies between SQLContext and SparkSession
> 
>
> Key: SPARK-15075
> URL: https://issues.apache.org/jira/browse/SPARK-15075
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> We currently in SparkSession.Builder use SQLContext.getOrCreate. It should 
> probably the the other way around, i.e. all the core logic goes in 
> SparkSession, and SQLContext just calls that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292706#comment-15292706
 ] 

Xiao Li commented on SPARK-15396:
-

Based on your description, it sounds like that is caused by the parameter you 
used. If you can fix the issue by setting  `spark.sql.warehouse.dir`, I think 
that is a document issue. I will submit a PR soon. Thanks!

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does 

[jira] [Updated] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2016-05-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-11827:

Assignee: kevin yu

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Assignee: kevin yu
>Priority: Minor
> Fix For: 2.0.0
>
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2016-05-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-11827.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10125
[https://github.com/apache/spark/pull/10125]

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Priority: Minor
> Fix For: 2.0.0
>
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292701#comment-15292701
 ] 

Apache Spark commented on SPARK-15431:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/13212

> Support LIST FILE(s)|JAR(s) command natively
> 
>
> Key: SPARK-15431
> URL: https://issues.apache.org/jira/browse/SPARK-15431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Currently command "ADD FILE|JAR " is supported natively in  
> SparkSQL. However, when this command is run, the file/jar is added to the 
> resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because 
> the LIST command is passed to Hive command processor in Spark-SQL or simply 
> not supported in Spark-shell. There is no way users can find out what 
> files/jars are added to the spark context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15431:


Assignee: Apache Spark

> Support LIST FILE(s)|JAR(s) command natively
> 
>
> Key: SPARK-15431
> URL: https://issues.apache.org/jira/browse/SPARK-15431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>
> Currently command "ADD FILE|JAR " is supported natively in  
> SparkSQL. However, when this command is run, the file/jar is added to the 
> resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because 
> the LIST command is passed to Hive command processor in Spark-SQL or simply 
> not supported in Spark-shell. There is no way users can find out what 
> files/jars are added to the spark context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15431:


Assignee: (was: Apache Spark)

> Support LIST FILE(s)|JAR(s) command natively
> 
>
> Key: SPARK-15431
> URL: https://issues.apache.org/jira/browse/SPARK-15431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xin Wu
>
> Currently command "ADD FILE|JAR " is supported natively in  
> SparkSQL. However, when this command is run, the file/jar is added to the 
> resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because 
> the LIST command is passed to Hive command processor in Spark-SQL or simply 
> not supported in Spark-shell. There is no way users can find out what 
> files/jars are added to the spark context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292689#comment-15292689
 ] 

Miao Wang commented on SPARK-15363:
---

[~mengxr] Yanbo pointed me https://github.com/apache/spark/pull/13202. Now I 
understand using public API in this PR in the example code. 
Thanks!

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15432) Two executors with same id in Spark UI

2016-05-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15432:
--

 Summary: Two executors with same id in Spark UI
 Key: SPARK-15432
 URL: https://issues.apache.org/jira/browse/SPARK-15432
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Davies Liu


Both of them are dead.

{code}
56  10.0.245.96:50929   Dead0   0.0 B / 15.3 GB 0.0 B   4   
0   0   0   0   0 ms (0 ms) 0.0 B   0.0 B   0.0 B   
stdout
stderr
56  10.0.245.96:50929   Dead0   0.0 B / 15.3 GB 0.0 B   4   
0   0   0   0   0 ms (0 ms) 0.0 B   0.0 B   0.0 B   
stdout
stderr

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15321) Encoding/decoding of Array[Timestamp] fails

2016-05-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15321.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13108
[https://github.com/apache/spark/pull/13108]

> Encoding/decoding of Array[Timestamp] fails
> ---
>
> Key: SPARK-15321
> URL: https://issues.apache.org/jira/browse/SPARK-15321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sumedh Mungee
>Assignee: Sumedh Mungee
> Fix For: 2.0.0
>
>
> In {{ExpressionEncoderSuite}}, if you add the following test case:
> {code}
> encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of 
> timestamp") 
> {code}
> ... you will see that (without this fix) it fails with the following output:
> {code}
> - encode/decode for array of timestamp: [Ljava.sql.Timestamp;@fd9ebde *** 
> FAILED ***
>   Exception thrown while decoding
>   Converted: [0,100010,80001,52a7ccdc36800]
>   Schema: value#61615
>   root
>   -- value: array (nullable = true)
>   |-- element: timestamp (containsNull = true)
>   Encoder:
>   class[value[0]: array] (ExpressionEncoderSuite.scala:312)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15431) Support LIST FILE(s)|JAR(s) command natively

2016-05-19 Thread Xin Wu (JIRA)
Xin Wu created SPARK-15431:
--

 Summary: Support LIST FILE(s)|JAR(s) command natively
 Key: SPARK-15431
 URL: https://issues.apache.org/jira/browse/SPARK-15431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xin Wu


Currently command "ADD FILE|JAR " is supported natively in  
SparkSQL. However, when this command is run, the file/jar is added to the 
resources that can not be looked up by "LIST FILE(s)|JAR(s)" command because 
the LIST command is passed to Hive command processor in Spark-SQL or simply not 
supported in Spark-shell. There is no way users can find out what files/jars 
are added to the spark context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15430:


Assignee: Apache Spark

> Access ListAccumulator's value could possibly cause 
> java.util.ConcurrentModificationException
> -
>
> Key: SPARK-15430
> URL: https://issues.apache.org/jira/browse/SPARK-15430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> In ListAccumulator we create an unmodifiable view for underlying list. 
> However, it doesn't prevent the underlying to be modified further. So as we 
> access the unmodifiable list, the underlying list can be modified in the same 
> time. It could cause java.util.ConcurrentModificationException. We should fix 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292682#comment-15292682
 ] 

Apache Spark commented on SPARK-15430:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/13211

> Access ListAccumulator's value could possibly cause 
> java.util.ConcurrentModificationException
> -
>
> Key: SPARK-15430
> URL: https://issues.apache.org/jira/browse/SPARK-15430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In ListAccumulator we create an unmodifiable view for underlying list. 
> However, it doesn't prevent the underlying to be modified further. So as we 
> access the unmodifiable list, the underlying list can be modified in the same 
> time. It could cause java.util.ConcurrentModificationException. We should fix 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15430:


Assignee: (was: Apache Spark)

> Access ListAccumulator's value could possibly cause 
> java.util.ConcurrentModificationException
> -
>
> Key: SPARK-15430
> URL: https://issues.apache.org/jira/browse/SPARK-15430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> In ListAccumulator we create an unmodifiable view for underlying list. 
> However, it doesn't prevent the underlying to be modified further. So as we 
> access the unmodifiable list, the underlying list can be modified in the same 
> time. It could cause java.util.ConcurrentModificationException. We should fix 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15430) Access ListAccumulator's value could possibly cause java.util.ConcurrentModificationException

2016-05-19 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-15430:
---

 Summary: Access ListAccumulator's value could possibly cause 
java.util.ConcurrentModificationException
 Key: SPARK-15430
 URL: https://issues.apache.org/jira/browse/SPARK-15430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


In ListAccumulator we create an unmodifiable view for underlying list. However, 
it doesn't prevent the underlying to be modified further. So as we access the 
unmodifiable list, the underlying list can be modified in the same time. It 
could cause java.util.ConcurrentModificationException. We should fix it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15429) When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the receiving rate accurately.

2016-05-19 Thread Albert Cheng (JIRA)
Albert Cheng created SPARK-15429:


 Summary: When `spark.streaming.concurrentJobs > 1`, 
PIDRateEstimator cannot estimate the receiving rate accurately.
 Key: SPARK-15429
 URL: https://issues.apache.org/jira/browse/SPARK-15429
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.6.1
Reporter: Albert Cheng


When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the 
receiving rate accurately.

For example, if the batch duration is set to 10 seconds, each rdd in the 
dstream will take 20s to compute. By changing 
`spark.streaming.concurrentJobs=2`, each rdd in the dstream still takes 20s to 
consume the data, which leads to poor estimation of backpressure by 
PIDRateEstimator.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15308) RowEncoder should preserve nested column name.

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15308:
-
Target Version/s: 2.0.0

> RowEncoder should preserve nested column name.
> --
>
> Key: SPARK-15308
> URL: https://issues.apache.org/jira/browse/SPARK-15308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> The following code generates wrong schema:
> {code}
> val schema = new StructType().add(
>   "struct",
>   new StructType()
> .add("i", IntegerType, nullable = false)
> .add(
>   "s",
>   new StructType().add("int", IntegerType, nullable = false),
>   nullable = false),
>   nullable = false)
> val ds = sqlContext.range(10).map(l => Row(l, Row(l)))(RowEncoder(schema))
> ds.printSchema()
> {code}
> This should print as follows:
> {code}
> root
>  |-- struct: struct (nullable = false)
>  ||-- i: integer (nullable = false)
>  ||-- s: struct (nullable = false)
>  |||-- int: integer (nullable = false)
> {code}
> but the result is:
> {code}
> root
>  |-- struct: struct (nullable = false)
>  ||-- col1: integer (nullable = false)
>  ||-- col2: struct (nullable = false)
>  |||-- col1: integer (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15313) EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15313:
-
Target Version/s: 2.0.0

> EmbedSerializerInFilter rule should keep exprIds of output of surrounded 
> SerializeFromObject.
> -
>
> Key: SPARK-15313
> URL: https://issues.apache.org/jira/browse/SPARK-15313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> The following code:
> {code}
> val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
> ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
> {code}
> throws an Exception:
> {noformat}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _1#420
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:254)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>  at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>  at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:79)
>  at 
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:194)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>  at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>  at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>  at 
> org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:113)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:79)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>  

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-19 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292672#comment-15292672
 ] 

Yi Zhou commented on SPARK-15396:
-

It seem it is not only a doc issue , it may be functional issue.

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged 

[jira] [Commented] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-19 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292655#comment-15292655
 ] 

Yi Zhou commented on SPARK-15396:
-

Hi [~rxin]
I saw a bug fix https://issues.apache.org/jira/browse/SPARK-15345. Is it also 
fixed this issue in this jira ?

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> 

[jira] [Resolved] (SPARK-15296) Refactor All Java Tests that use SparkSession

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15296.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13101
[https://github.com/apache/spark/pull/13101]

> Refactor All Java Tests that use SparkSession
> -
>
> Key: SPARK-15296
> URL: https://issues.apache.org/jira/browse/SPARK-15296
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's a lot of Duplicate code in Java tests. {{setUp()}} and {{tearDown()}} 
> of most java test classes in ML,MLLib.
> So will create a {{SharedSparkSession}} class that has common code for 
> {{setUp}} and {{tearDown}} and other classes just extend that class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15428) Disable support for multiple streaming aggregations

2016-05-19 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-15428:
--
Target Version/s: 2.0.0

> Disable support for multiple streaming aggregations
> ---
>
> Key: SPARK-15428
> URL: https://issues.apache.org/jira/browse/SPARK-15428
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Incrementalizing plans of with multiple streaming aggregation is tricky and 
> we dont have the necessary support for "delta" to implement correctly. So 
> disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15428) Disable support for multiple streaming aggregations

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15428:


Assignee: Apache Spark  (was: Tathagata Das)

> Disable support for multiple streaming aggregations
> ---
>
> Key: SPARK-15428
> URL: https://issues.apache.org/jira/browse/SPARK-15428
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Incrementalizing plans of with multiple streaming aggregation is tricky and 
> we dont have the necessary support for "delta" to implement correctly. So 
> disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15428) Disable support for multiple streaming aggregations

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292629#comment-15292629
 ] 

Apache Spark commented on SPARK-15428:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13210

> Disable support for multiple streaming aggregations
> ---
>
> Key: SPARK-15428
> URL: https://issues.apache.org/jira/browse/SPARK-15428
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Incrementalizing plans of with multiple streaming aggregation is tricky and 
> we dont have the necessary support for "delta" to implement correctly. So 
> disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15428) Disable support for multiple streaming aggregations

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15428:


Assignee: Tathagata Das  (was: Apache Spark)

> Disable support for multiple streaming aggregations
> ---
>
> Key: SPARK-15428
> URL: https://issues.apache.org/jira/browse/SPARK-15428
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Incrementalizing plans of with multiple streaming aggregation is tricky and 
> we dont have the necessary support for "delta" to implement correctly. So 
> disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15428) Disable support for multiple streaming aggregations

2016-05-19 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-15428:
-

 Summary: Disable support for multiple streaming aggregations
 Key: SPARK-15428
 URL: https://issues.apache.org/jira/browse/SPARK-15428
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das


Incrementalizing plans of with multiple streaming aggregation is tricky and we 
dont have the necessary support for "delta" to implement correctly. So 
disabling the support for multiple streaming aggregations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15427) Spark SQL doesn't support field case sensitive when load data use Phoenix

2016-05-19 Thread deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deng updated SPARK-15427:
-
Description: 
I use sparkSql load data from Apache Phoenix.

SQLContext sqlContext = new SQLContext(sc);

 Map options = new HashMap();
 options.put("driver", driver);
 options.put("url", PhoenixUtil.p.getProperty("phoenixURL"));
  options.put("dbtable", "(select "value","name" from "user")");
  DataFrame jdbcDF = sqlContext.load("jdbc", options);

It always throws exception, like "can't find field VALUE". 
I tracked the code and found  spark will use:
  val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 
1=0").executeQuery()
to get the field.But the field already be uppercased, like "value" to VALUE. So 
it will always throws "can't find field VALUE";

It didn't think of the the case when data loaded from source in which filed is 
case sensitive. 


  was:
i am use sparkSql load data from Apache Phoenix.

SQLContext sqlContext = new SQLContext(sc);

 Map options = new HashMap();
 options.put("driver", driver);
 options.put("url", PhoenixUtil.p.getProperty("phoenixURL"));
  options.put("dbtable", "(select "value","name" from "user")");
  DataFrame jdbcDF = sqlContext.load("jdbc", options);

It will always throws exception, like "can't find field VALUE". 
I track the code and find  spark will use:
  val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 
1=0").executeQuery()
to get the field.But the field already be uppercase like "value" to VALUE. So 
it will always throws "can't find field VALUE";

It didn't think of the the case when data loaded from source in which filed is 
case sensitive. 



> Spark SQL doesn't support field case sensitive when load data use Phoenix
> -
>
> Key: SPARK-15427
> URL: https://issues.apache.org/jira/browse/SPARK-15427
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: deng
>  Labels: easyfix, features, newbie
>
> I use sparkSql load data from Apache Phoenix.
> SQLContext sqlContext = new SQLContext(sc);
>  Map options = new HashMap();
>  options.put("driver", driver);
>  options.put("url", PhoenixUtil.p.getProperty("phoenixURL"));
>   options.put("dbtable", "(select "value","name" from "user")");
>   DataFrame jdbcDF = sqlContext.load("jdbc", options);
> It always throws exception, like "can't find field VALUE". 
> I tracked the code and found  spark will use:
>   val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 
> 1=0").executeQuery()
> to get the field.But the field already be uppercased, like "value" to VALUE. 
> So it will always throws "can't find field VALUE";
> It didn't think of the the case when data loaded from source in which filed 
> is case sensitive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15427) Spark SQL doesn't support field case sensitive when load data use Phoenix

2016-05-19 Thread deng (JIRA)
deng created SPARK-15427:


 Summary: Spark SQL doesn't support field case sensitive when load 
data use Phoenix
 Key: SPARK-15427
 URL: https://issues.apache.org/jira/browse/SPARK-15427
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.0
Reporter: deng


i am use sparkSql load data from Apache Phoenix.

SQLContext sqlContext = new SQLContext(sc);

 Map options = new HashMap();
 options.put("driver", driver);
 options.put("url", PhoenixUtil.p.getProperty("phoenixURL"));
  options.put("dbtable", "(select "value","name" from "user")");
  DataFrame jdbcDF = sqlContext.load("jdbc", options);

It will always throws exception, like "can't find field VALUE". 
I track the code and find  spark will use:
  val rs = conn.prepareStatement(s"SELECT * FROM $table WHERE 
1=0").executeQuery()
to get the field.But the field already be uppercase like "value" to VALUE. So 
it will always throws "can't find field VALUE";

It didn't think of the the case when data loaded from source in which filed is 
case sensitive. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-05-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292599#comment-15292599
 ] 

Nicholas Chammas commented on SPARK-3821:
-

You can deploy Spark today on Docker just fine. It's just that Spark itself 
does not maintain any official Dockerfiles and likely never will since the 
project is actually trying to push deployment stuff outside the main project 
(hence why spark-ec2 was moved out; you will not see spark-ec2 in the official 
docs once Spark 2.0 comes out). You may be more interested in the Apache Big 
Top project, which focuses on big data system deployment (including Spark) and 
may have stuff for Docker specifically. 

Mesos is a separate matter, because it's a resource manager (analogous to YARN) 
that integrates with Spark at a low level.

If you still think Spark should host and maintain an official Dockerfile and 
Docker images that are suitable for production use, please open a separate 
issue. I think the maintainers will reject it on the grounds that I have 
explained here, though. (Can't say for sure; after all I'm just a random 
contributor.)

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15425:


Assignee: Apache Spark

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292558#comment-15292558
 ] 

Apache Spark commented on SPARK-15425:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/13209

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15425:


Assignee: (was: Apache Spark)

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15420) Repartition and sort before Parquet writes

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15420:

Target Version/s: 2.1.0

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14543) SQL/Hive insertInto has unexpected results

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14543:

Target Version/s: 2.0.0

> SQL/Hive insertInto has unexpected results
> --
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>
> The Hive write path adds a pre-insertion cast (projection) to reconcile 
> incoming data columns with the outgoing table schema. Columns are matched by 
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran 
> into this by not using a correct {{partitionBy}} call (addressed by 
> SPARK-14459), which caused an error message that an int could not be cast to 
> an array. However, if the columns are vaguely compatible, for example string 
> and float, then no error or warning is produced and data is written to the 
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes 
> by adding a column in the middle of a table. Spark SQL statements that copied 
> from that table to a destination table will then map the columns differently 
> but insert casts that mask the problem. The last column's data will be 
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException 
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using 
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and 
> LogicalRelation
> Also, I think there should be an option to match input data to output columns 
> by name. The API allows operations on tables, which hide the column 
> resolution problem. It's easy to copy from one table to another without 
> listing the columns, and in the API it is common to work with columns by name 
> rather than by position. I think the API should add a way to match columns by 
> name, which is closer to what users expect. I propose adding something like 
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15296) Refactor All Java Tests that use SparkSession

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15296:

Assignee: Sandeep Singh
Target Version/s: 2.0.0

> Refactor All Java Tests that use SparkSession
> -
>
> Key: SPARK-15296
> URL: https://issues.apache.org/jira/browse/SPARK-15296
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, Tests
>Reporter: Sandeep Singh
>Assignee: Sandeep Singh
>Priority: Minor
>
> There's a lot of Duplicate code in Java tests. {{setUp()}} and {{tearDown()}} 
> of most java test classes in ML,MLLib.
> So will create a {{SharedSparkSession}} class that has common code for 
> {{setUp}} and {{tearDown}} and other classes just extend that class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14261:

Target Version/s: 2.0.0

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, 
> 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14261) Memory leak in Spark Thrift Server

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14261:

Target Version/s: 1.6.2, 2.0.0  (was: 2.0.0)

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-14261
> URL: https://issues.apache.org/jira/browse/SPARK-14261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xiaochun Liang
> Attachments: 16716_heapdump_64g.PNG, 16716_heapdump_80g.PNG, 
> 8892_4g_objects.PNG, 8892_5g_objects.PNG, 8892_6g_objects.PNG, 
> 8892_6g_stop_longrunquery_objects.PNG, 8892_MemorySnapshot.PNG, 
> MemorySnapshot.PNG
>
>
> I am running Spark Thrift server on Windows Server 2012. The Spark Thrift 
> server is launched as Yarn client mode. Its memory usage is increased 
> gradually with the queries in.  I am wondering there is memory leak in Spark 
> Thrift server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15321) Encoding/decoding of Array[Timestamp] fails

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15321:

 Assignee: Sumedh Mungee
Affects Version/s: (was: 2.0.0)
 Target Version/s: 2.0.0
  Component/s: (was: Spark Core)
   SQL

> Encoding/decoding of Array[Timestamp] fails
> ---
>
> Key: SPARK-15321
> URL: https://issues.apache.org/jira/browse/SPARK-15321
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sumedh Mungee
>Assignee: Sumedh Mungee
>
> In {{ExpressionEncoderSuite}}, if you add the following test case:
> {code}
> encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of 
> timestamp") 
> {code}
> ... you will see that (without this fix) it fails with the following output:
> {code}
> - encode/decode for array of timestamp: [Ljava.sql.Timestamp;@fd9ebde *** 
> FAILED ***
>   Exception thrown while decoding
>   Converted: [0,100010,80001,52a7ccdc36800]
>   Schema: value#61615
>   root
>   -- value: array (nullable = true)
>   |-- element: timestamp (containsNull = true)
>   Encoder:
>   class[value[0]: array] (ExpressionEncoderSuite.scala:312)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14990) nvl, coalesce, array functions with parameter of type "array"

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292524#comment-15292524
 ] 

Apache Spark commented on SPARK-14990:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13208

> nvl, coalesce, array functions with parameter of type "array"
> -
>
> Key: SPARK-14990
> URL: https://issues.apache.org/jira/browse/SPARK-14990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oleg Danilov
>Priority: Minor
>
> Steps to reproduce:
> 1. create table tmp(col1 int, col2 array)
> 2. insert values:
> {code}
>   1, [1]
>   2, [2]
>   3, NULL
> {code}
> 3. run query 
> select col1, coalesce(col2, array(1,2,3)) from tmp;
> Expected result:
> {code}
> 1, [1]
> 2, [2]
> 3, [1,2,3]
> {code}
> Current result:
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'coalesce(col2,array(1,2,3))' due to data type mismatch: input to function 
> coalesce should all be the same type, but it's [array, array]; line 
> 1 pos 38 (state=,code=0)
> {code}
> The fix seems to be pretty easy, will create a PR soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-15416:
-
Assignee: Shixiong Zhu

> Display a better message for not finding classes removed in Spark 2.0
> -
>
> Key: SPARK-15416
> URL: https://issues.apache.org/jira/browse/SPARK-15416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> We removed some classes in Spark 2.0. If the user uses an incompatible 
> library, he may see ClassNotFoundException. It's better to give an 
> instruction to ask people using a correct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15416) Display a better message for not finding classes removed in Spark 2.0

2016-05-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15416.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13201
[https://github.com/apache/spark/pull/13201]

> Display a better message for not finding classes removed in Spark 2.0
> -
>
> Key: SPARK-15416
> URL: https://issues.apache.org/jira/browse/SPARK-15416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
> Fix For: 2.0.0
>
>
> We removed some classes in Spark 2.0. If the user uses an incompatible 
> library, he may see ClassNotFoundException. It's better to give an 
> instruction to ask people using a correct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12978:

Target Version/s: 2.0.0

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15425:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15426

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15426) Spark 2.0 SQL API audit

2016-05-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15426:
---

 Summary: Spark 2.0 SQL API audit
 Key: SPARK-15426
 URL: https://issues.apache.org/jira/browse/SPARK-15426
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


This is an umbrella ticket to list issues I found with APIs for the 2.0 release.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15425:

Description: 
It is fairly easy for users to shoot themselves in the foot if they run 
cartesian joins. Often they might not even be aware of the join methods chosen. 
This happened to me a few times in the last few weeks.

It would be a good idea to disable cartesian joins by default, and require 
explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
however might be too large of a scope for 2.0 given the timing. As a small and 
quick fix, we can just have a single config option 
(spark.sql.join.enableCartesian) that controls this behavior. In the future we 
can implement the fine-grained control.

Note that the error message should be friendly and say "Set 
spark.sql.join.enableCartesian to true to turn on cartesian joins."




  was:
It is fairly easy for users to shoot themselves in the foot if they run 
cartesian joins. Often they might not even be aware of the join methods chosen. 
This happened to me a few times in the last few weeks.

It would be a good idea to disable cartesian joins by default, and require 
explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
however might be too large of a scope for 2.0 given the timing. As a small and 
quick fix, we can just have a single config option 
(spark.sql.join.enableCartesian) that controls this behavior. In the future we 
can implement the fine-grained control.





> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.
> Note that the error message should be friendly and say "Set 
> spark.sql.join.enableCartesian to true to turn on cartesian joins."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15425:

Target Version/s: 2.0.0

> Disallow cartesian joins by default
> ---
>
> Key: SPARK-15425
> URL: https://issues.apache.org/jira/browse/SPARK-15425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> It is fairly easy for users to shoot themselves in the foot if they run 
> cartesian joins. Often they might not even be aware of the join methods 
> chosen. This happened to me a few times in the last few weeks.
> It would be a good idea to disable cartesian joins by default, and require 
> explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
> however might be too large of a scope for 2.0 given the timing. As a small 
> and quick fix, we can just have a single config option 
> (spark.sql.join.enableCartesian) that controls this behavior. In the future 
> we can implement the fine-grained control.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15425) Disallow cartesian joins by default

2016-05-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15425:
---

 Summary: Disallow cartesian joins by default
 Key: SPARK-15425
 URL: https://issues.apache.org/jira/browse/SPARK-15425
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


It is fairly easy for users to shoot themselves in the foot if they run 
cartesian joins. Often they might not even be aware of the join methods chosen. 
This happened to me a few times in the last few weeks.

It would be a good idea to disable cartesian joins by default, and require 
explicit enabling of it via "crossJoin" method or in SQL "cross join". This 
however might be too large of a scope for 2.0 given the timing. As a small and 
quick fix, we can just have a single config option 
(spark.sql.join.enableCartesian) that controls this behavior. In the future we 
can implement the fine-grained control.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15329) When start spark with yarn: spark.SparkContext: Error initializing SparkContext.

2016-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15329.
---
Resolution: Not A Problem

>  When start spark with yarn: spark.SparkContext: Error initializing 
> SparkContext. 
> --
>
> Key: SPARK-15329
> URL: https://issues.apache.org/jira/browse/SPARK-15329
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Jon
>
> Hi, Im trying to start spark with yarn-client, like this "spark-shell 
> --master yarn-client" but Im getting the error below.
> If I start spark just with "spark-shell" everything works fine.
> I have a single node machine where I have all hadoop processes running, and a 
> hive metastore server running.
> I already try more than 30 different configurations, but nothing is working, 
> the config that I have now is this:
> core-site.xml:
> 
> 
> fs.defaultFS
> hdfs://masternode:9000
> 
> 
> hdfs-site.xml:
> 
> 
> dfs.replication
> 1
> 
> 
> yarn-site.xml:
> 
> 
> yarn.resourcemanager.resource-tracker.address
> masternode:8031
> 
> 
> yarn.resourcemanager.address
> masternode:8032
> 
> 
> yarn.resourcemanager.scheduler.address
> masternode:8030
> 
> 
> yarn.resourcemanager.admin.address
> masternode:8033
> 
> 
> yarn.resourcemanager.webapp.address
> masternode:8088
> 
> 
> About spark confs:
> spark-env.sh:
> HADOOP_CONF_DIR=/usr/local/hadoop-2.7.1/hadoop
> SPARK_MASTER_IP=masternode
> spark-defaults.conf
> spark.master spark://masternode:7077
> spark.serializer org.apache.spark.serializer.KryoSerializer
> Do you understand why this is happening?
> hadoopadmin@mn:~$ spark-shell --master yarn-client
> 16/05/14 23:21:07 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/05/14 23:21:07 INFO spark.SecurityManager: Changing view acls to: 
> hadoopadmin
> 16/05/14 23:21:07 INFO spark.SecurityManager: Changing modify acls to: 
> hadoopadmin
> 16/05/14 23:21:07 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hadoopadmin); 
> users with modify permissions: Set(hadoopadmin)
> 16/05/14 23:21:08 INFO spark.HttpServer: Starting HTTP Server
> 16/05/14 23:21:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/05/14 23:21:08 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:36979
> 16/05/14 23:21:08 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 36979.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/05/14 23:21:12 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/05/14 23:21:12 INFO spark.SecurityManager: Changing view acls to: 
> hadoopadmin
> 16/05/14 23:21:12 INFO spark.SecurityManager: Changing modify acls to: 
> hadoopadmin
> 16/05/14 23:21:12 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(hadoopadmin); 
> users with modify permissions: Set(hadoopadmin)
> 16/05/14 23:21:12 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 33128.
> 16/05/14 23:21:13 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/05/14 23:21:13 INFO Remoting: Starting remoting
> 16/05/14 23:21:13 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.15.0.11:34382]
> 16/05/14 23:21:13 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 34382.
> 16/05/14 23:21:13 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/05/14 23:21:13 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/05/14 23:21:13 INFO storage.DiskBlockManager: Created local directory at 
> /tmp/blockmgr-a0048199-bf2f-404b-9cd2-b5988367783f
> 16/05/14 23:21:13 INFO storage.MemoryStore: MemoryStore started with capacity 
> 511.1 MB
> 16/05/14 23:21:13 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/05/14 23:21:13 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/05/14 23:21:13 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/05/14 23:21:13 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/05/14 23:21:13 INFO ui.SparkUI: Started SparkUI at http://10.15.0.11:4040
> 16/05/14 23:21:14 INFO client.RMProxy: Connecting to ResourceManager at 
> localhost/127.0.0.1:8032
> 16/05/14 23:21:14 INFO yarn.Client: Requesting a new application from cluster 
> with 1 NodeManagers
> 16/05/14 23:21:14 INFO yarn.Client: Verifying our application has not 
> requested 

[jira] [Commented] (SPARK-15403) LinearRegressionWithSGD fails on files more than 12Mb data

2016-05-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292500#comment-15292500
 ] 

Sean Owen commented on SPARK-15403:
---

This looks like some problem in your Spark environment, like mismatched or 
conflicting builds of Spark. ClassNotFoundException wouldn't be something to do 
with data.

> LinearRegressionWithSGD fails on files more than 12Mb data 
> ---
>
> Key: SPARK-15403
> URL: https://issues.apache.org/jira/browse/SPARK-15403
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Ubuntu 14.04 with 8 Gb Ram, scala 2.11.7 with following 
> memory settings for my project:  JAVA_OPTS="-Xmx8G -Xms2G" .
>Reporter: Ana La
>Priority: Blocker
>
> I parse my json-like data, passing by DataFrame and SparkSql facilities and 
> then scale one numerical feature and create dummy variables for categorical 
> features. So far from initial 14 keys of my json-like file I get about 
> 200-240 features in the final LabeledPoint. The final data is sparse and 
> every file contains as minimum 5 of observations. I try to run two types 
> of algorithms on data :  LinearRegressionWithSGD or LassoWithSGD, since the 
> data is sparse and regularization might be required.  For data larger than 
> 11MB  LinearRegressionWithSGD fails with the following error:
> {quote} org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 58 in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in 
> stage 346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver 
> exited caused by one of the running tasks) Reason: Executor heartbeat timed 
> out after 179307 ms {quote}
> I tried to reproduce this bug with bug in smaller example, and I suppose that 
> something wrong could be with LinearRegressionWithSGD on large sets of data. 
> I notices that while using StandardScaler on preprocessing step and counts on 
> Linear Regression step, collect() method is perform, that can cause the bug. 
> So the possibility to scale Linear regression is questioned, cause, as I far 
> as I understand it, collect() performs on driver and so the sens of scaled 
> calculations is lost. 
> {code:scala}
> import java.io.{File}
> import org.apache.spark.mllib.linalg.{Vectors}
> import org.apache.spark.mllib.regression.{LabeledPoint, LassoWithSGD}
> import org.apache.spark.rdd.RDD
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.{SQLContext}
> import scala.language.postfixOps
> object Main2 {
>   def main(args: Array[String]): Unit = {
> // Spark configuration is defined for execution on local computer, 4 
> cores 8Mb Ram
> val conf = new SparkConf()
>   .setMaster(s"local[*]")
>   .setAppName("spark_linear_regression_bug_report")
>   //multiple configurations were tried for driver/executor memories, 
> including default configurations
>   .set("spark.driver.memory", "3g")
>   .set("spark.executor.memory", "3g")
>   .set("spark.executor.heartbeatInterval", "30s")
> // Spark context and SQL context definitions
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> val countFeatures = 500
> val countList = 50
> val features = sc.broadcast(1 to countFeatures)
> val rdd: RDD[LabeledPoint] = sc.range(1, countList).map { i =>
>   LabeledPoint(
> label = i.toDouble,
> features = Vectors.dense(features.value.map(_ => 
> scala.util.Random.nextInt(2).toDouble).toArray)
>   )
> }.persist()
> val numIterations = 1000
> val stepSize = 0.3
> val algorithm = new LassoWithSGD() //LassoWithSGD() 
> algorithm.setIntercept(true)
> algorithm.optimizer
>   .setNumIterations(numIterations)
>   .setStepSize(stepSize)
> val model = algorithm.run(rdd)
>   }
> }
> {code}
> the complete Error of the bug :
> {quote}  [info] Running Main 
> WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> WARN  org.apache.spark.util.Utils - Your hostname, julien-ubuntu resolves to 
> a loopback address: 127.0.1.1; using 192.168.0.49 instead (on interface wlan0)
> WARN  org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to 
> another address
> INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
> INFO  Remoting - Starting remoting
> INFO  Remoting - Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@192.168.0.49:59897]
> INFO  org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
> INFO  org.spark-project.jetty.server.AbstractConnector - Started 
> SelectChannelConnector@0.0.0.0:4040
> WARN  com.github.fommil.netlib.BLAS - Failed to load 

[jira] [Commented] (SPARK-14989) Upgrade to Jackson 2.7.3

2016-05-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292497#comment-15292497
 ] 

Sean Owen commented on SPARK-14989:
---

See https://github.com/apache/spark/pull/9759 by the way.
I think the problem is using something from 2.7 but then executing in an 
environment that provides 2.4 or 2.5.

> Upgrade to Jackson 2.7.3
> 
>
> Key: SPARK-14989
> URL: https://issues.apache.org/jira/browse/SPARK-14989
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> For Spark 2.0, we should upgrade to a newer version of Jackson (2.7.3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15404) pyspark sql bug ,here is the testcase

2016-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15404.
---
Resolution: Invalid

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> pyspark sql bug ,here is the testcase
> -
>
> Key: SPARK-15404
> URL: https://issues.apache.org/jira/browse/SPARK-15404
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: 郭同
>
> import os
> import sys
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from pyspark.sql.types import Row, StructField, StructType, StringType, 
> IntegerType
> if __name__ == "__main__":
> sc = SparkContext(appName="PythonSQL")
> sqlContext = SQLContext(sc)
> schema = StructType([StructField("person_name", StringType(), False),
>  StructField("person_age", IntegerType(), False)])
> some_rdd = sc.parallelize([Row(person_name="John", person_age=19),
>Row(person_name="Smith", person_age=23),
>Row(person_name="Sarah", person_age=18)])
> some_df = sqlContext.createDataFrame(some_rdd, schema)
> some_df.printSchema()
> some_df.registerAsTable("people")
> teenagers = sqlContext.sql("SELECT * FROM people ")
> for each in teenagers.collect():
> print(each)
> sc.stop()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15423) why it is very slow to clean resources in Spark

2016-05-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15423.
---
  Resolution: Invalid
   Fix Version/s: (was: 2.0.0)
Target Version/s:   (was: 2.0.0)

Questions go to user@  Right now there's no sign it's a Spark problem per se.

Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark fist

> why it is very slow to clean resources in Spark
> ---
>
> Key: SPARK-15423
> URL: https://issues.apache.org/jira/browse/SPARK-15423
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, MLlib
>Affects Versions: 2.0.0
> Environment: RedHat 6.5 (64 bit), JDK 1.8, Standalone mode
>Reporter: zszhong
>  Labels: newbie, starter
>
> Hi, everyone! I'm new to Spark. Originally I submitted a post in 
> [http://stackoverflow.com/questions/37331226/why-it-is-very-slow-to-clean-resources-in-spark],
>  but somebody think that it is off-topic. Thus I post here to ask for your 
> help. If this post is not related here, please feel free to delete it. I just 
> copy the content here, I don't know how to edit the code to be more readable, 
> so please refer to the link in stackoverflow.
> I've submitted a very simple task into a standalone Spark environment 
> (`spark-2.0.0-preview`, `jdk 1.8`, `48 CPU cores`, `250 Gb memory`) with the 
> following command:
> bin/spark-submit.sh --master spark://hostname.domain:7077 --conf 
> "spark.executor.memory=8G" ../SimpleApp.py ../data/train/ ../data/val/
> where the `SimpleApp.py` is:
> from __future__ import print_function
> import sys
> import os
> import numpy as np
> from pyspark import SparkContext 
> from pyspark.mllib.tree import RandomForest, RandomForestModel
> from pyspark.mllib.util import MLUtils 
> trainDataPath = sys.argv[1]
> valDataPath = sys.argv[2]
> sc = SparkContext(appName="Classification using Spark Random Forest")
> trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
> valData = MLUtils.loadLibSVMFile(sc, valDataPath) 
>model = RandomForest.trainClassifier(trainData, numClasses=6, 
> categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", 
> impurity='gini', maxDepth=4, maxBins=32)
> predictions = model.predict(valData.map(lambda x: x.features))
> labelsAndPredictions = valData.map(lambda lp: 
> lp.label).zip(predictions)
> testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() 
> / float(valData.count())
> print('Test Error = ' + str(testErr))
> And the task is running OK and can output the `Test Error` as follows:
> Test Error = 0.380580779161
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
> 127.0.0.1:59714 in memory (size: 12.1 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
> 127.0.0.1:37978 in memory (size: 12.1 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
> 127.0.0.1:37978 in memory (size: 10.9 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
> 127.0.0.1:59714 in memory (size: 10.9 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
> 127.0.0.1:59714 in memory (size: 4.6 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
> 127.0.0.1:37978 in memory (size: 4.6 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
> 127.0.0.1:59714 in memory (size: 4.0 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
> 127.0.0.1:37978 in memory (size: 4.0 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
> 127.0.0.1:59714 in memory (size: 455.0 B, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
> 127.0.0.1:37978 in memory (size: 455.0 B, free: 4.5 GB)
> 16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 4
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
> 127.0.0.1:59714 in memory (size: 9.2 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
> 127.0.0.1:37978 in memory (size: 9.2 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
> 127.0.0.1:59714 in memory (size: 3.6 KB, free: 511.1 MB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
> 127.0.0.1:37978 in memory (size: 3.6 KB, free: 4.5 GB)
> 16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 
> 127.0.0.1:59714 in memory (size: 389.0 B, free: 511.1 MB)
> 16/05/20 

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-05-19 Thread Mete Kural (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292488#comment-15292488
 ] 

Mete Kural commented on SPARK-3821:
---

Thank you for the response Nicholas. spark-ec2 does take care of AMIs for ec2 
and in fact is documented in Spark documentation as a deployment method along 
with distribution with Spark. However, the same level of presence doesn't seem 
to exist for Docker as a deployment method. What's inside the docker folder in 
Spark is not really in shape for a production deployment, not documented in 
Spark documentation either, and doesn't seem to have been worked on in quite a 
while. It seems the only way the Spark project officially supports running 
Spark on Docker is via Mesos, would you say that is correct? With Docker 
becoming an industry standard as of a month ago, I hope there will be renewed 
interest within the Spark project in supporting Docker as an official 
deployment method without the Mesos requirement.

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15341:
--
Assignee: Yanbo Liang

> Add documentation for `model.write` to clarify `summary` was not saved 
> ---
>
> Key: SPARK-15341
> URL: https://issues.apache.org/jira/browse/SPARK-15341
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently in model.write, we don't save summary(if applicable). We should add 
> documentation to clarify it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15341) Add documentation for `model.write` to clarify `summary` was not saved

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15341:
--
Fix Version/s: 2.0.0

> Add documentation for `model.write` to clarify `summary` was not saved 
> ---
>
> Key: SPARK-15341
> URL: https://issues.apache.org/jira/browse/SPARK-15341
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently in model.write, we don't save summary(if applicable). We should add 
> documentation to clarify it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15375) Add ConsoleSink for structure streaming to display the dataframe on the fly

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15375.
-
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.0.0

> Add ConsoleSink for structure streaming to display the dataframe on the fly
> ---
>
> Key: SPARK-15375
> URL: https://issues.apache.org/jira/browse/SPARK-15375
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add a ConsoleSink for structure streaming, user could specify like other sink 
> and display on the console.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15414:
--
Assignee: Sandeep Singh

> Make the mllib,ml linalg type conversion APIs public
> 
>
> Key: SPARK-15414
> URL: https://issues.apache.org/jira/browse/SPARK-15414
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>
> We should open up the APIs for converting between new, old linear algebra 
> types (in spark.mllib.linalg):
> * Vector.asML
> * Vectors.fromML
> * same for Sparse/Dense and for Matrices
> I made these private originally, but they will be useful for users 
> transitioning workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15414) Make the mllib,ml linalg type conversion APIs public

2016-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15414.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13202
[https://github.com/apache/spark/pull/13202]

> Make the mllib,ml linalg type conversion APIs public
> 
>
> Key: SPARK-15414
> URL: https://issues.apache.org/jira/browse/SPARK-15414
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> We should open up the APIs for converting between new, old linear algebra 
> types (in spark.mllib.linalg):
> * Vector.asML
> * Vectors.fromML
> * same for Sparse/Dense and for Matrices
> I made these private originally, but they will be useful for users 
> transitioning workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292403#comment-15292403
 ] 

Xiangrui Meng commented on SPARK-15363:
---

No. I think we need to make the converters between new and old vectors public 
(WIP) and then in example code, we don't need implicits. Another option is to 
make implicits public.

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292381#comment-15292381
 ] 

Apache Spark commented on SPARK-15424:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13207

> Revert SPARK-14807 Create a hivecontext-compatibility module
> 
>
> Key: SPARK-15424
> URL: https://issues.apache.org/jira/browse/SPARK-15424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> I initially asked to create a hivecontext-compatibility module to put the 
> HiveContext there. But we are so close to Spark 2.0 release and there is only 
> a single class in it. It seems overkill to have an entire package, which 
> makes it more inconvenient, for a single class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15424:


Assignee: Reynold Xin  (was: Apache Spark)

> Revert SPARK-14807 Create a hivecontext-compatibility module
> 
>
> Key: SPARK-15424
> URL: https://issues.apache.org/jira/browse/SPARK-15424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> I initially asked to create a hivecontext-compatibility module to put the 
> HiveContext there. But we are so close to Spark 2.0 release and there is only 
> a single class in it. It seems overkill to have an entire package, which 
> makes it more inconvenient, for a single class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14807) Create a hivecontext-compatibility module

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292384#comment-15292384
 ] 

Apache Spark commented on SPARK-14807:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13207

> Create a hivecontext-compatibility module
> -
>
> Key: SPARK-14807
> URL: https://issues.apache.org/jira/browse/SPARK-14807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> In 2.0, SparkSession will replace SQLContext/HiveContext. We will move 
> HiveContext to a compatibility module and users can optionally use this 
> module to access HiveContext. 
> This jira is to create this compatibility module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15424:


Assignee: Apache Spark  (was: Reynold Xin)

> Revert SPARK-14807 Create a hivecontext-compatibility module
> 
>
> Key: SPARK-15424
> URL: https://issues.apache.org/jira/browse/SPARK-15424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> I initially asked to create a hivecontext-compatibility module to put the 
> HiveContext there. But we are so close to Spark 2.0 release and there is only 
> a single class in it. It seems overkill to have an entire package, which 
> makes it more inconvenient, for a single class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15424) Revert SPARK-14807 Create a hivecontext-compatibility module

2016-05-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15424:
---

 Summary: Revert SPARK-14807 Create a hivecontext-compatibility 
module
 Key: SPARK-15424
 URL: https://issues.apache.org/jira/browse/SPARK-15424
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


I initially asked to create a hivecontext-compatibility module to put the 
HiveContext there. But we are so close to Spark 2.0 release and there is only a 
single class in it. It seems overkill to have an entire package, which makes it 
more inconvenient, for a single class.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14807) Create a hivecontext-compatibility module

2016-05-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14807:

Summary: Create a hivecontext-compatibility module  (was: Create a 
compatibility module)

> Create a hivecontext-compatibility module
> -
>
> Key: SPARK-14807
> URL: https://issues.apache.org/jira/browse/SPARK-14807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.0
>
>
> In 2.0, SparkSession will replace SQLContext/HiveContext. We will move 
> HiveContext to a compatibility module and users can optionally use this 
> module to access HiveContext. 
> This jira is to create this compatibility module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15423) why it is very slow to clean resources in Spark

2016-05-19 Thread zszhong (JIRA)
zszhong created SPARK-15423:
---

 Summary: why it is very slow to clean resources in Spark
 Key: SPARK-15423
 URL: https://issues.apache.org/jira/browse/SPARK-15423
 Project: Spark
  Issue Type: Question
  Components: Block Manager, MLlib
Affects Versions: 2.0.0
 Environment: RedHat 6.5 (64 bit), JDK 1.8, Standalone mode
Reporter: zszhong
 Fix For: 2.0.0


Hi, everyone! I'm new to Spark. Originally I submitted a post in 
[http://stackoverflow.com/questions/37331226/why-it-is-very-slow-to-clean-resources-in-spark],
 but somebody think that it is off-topic. Thus I post here to ask for your 
help. If this post is not related here, please feel free to delete it. I just 
copy the content here, I don't know how to edit the code to be more readable, 
so please refer to the link in stackoverflow.

I've submitted a very simple task into a standalone Spark environment 
(`spark-2.0.0-preview`, `jdk 1.8`, `48 CPU cores`, `250 Gb memory`) with the 
following command:
bin/spark-submit.sh --master spark://hostname.domain:7077 --conf 
"spark.executor.memory=8G" ../SimpleApp.py ../data/train/ ../data/val/

where the `SimpleApp.py` is:

from __future__ import print_function
import sys
import os
import numpy as np
from pyspark import SparkContext 
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils 
trainDataPath = sys.argv[1]
valDataPath = sys.argv[2]
sc = SparkContext(appName="Classification using Spark Random Forest")
trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath) 
   model = RandomForest.trainClassifier(trainData, numClasses=6, 
categoricalFeaturesInfo={}, numTrees=3, featureSubsetStrategy="auto", 
impurity='gini', maxDepth=4, maxBins=32)
predictions = model.predict(valData.map(lambda x: x.features))
labelsAndPredictions = valData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / 
float(valData.count())
print('Test Error = ' + str(testErr))

And the task is running OK and can output the `Test Error` as follows:

Test Error = 0.380580779161
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
127.0.0.1:59714 in memory (size: 12.1 KB, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 
127.0.0.1:37978 in memory (size: 12.1 KB, free: 4.5 GB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
127.0.0.1:37978 in memory (size: 10.9 KB, free: 4.5 GB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_19_piece0 on 
127.0.0.1:59714 in memory (size: 10.9 KB, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
127.0.0.1:59714 in memory (size: 4.6 KB, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 
127.0.0.1:37978 in memory (size: 4.6 KB, free: 4.5 GB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
127.0.0.1:59714 in memory (size: 4.0 KB, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 
127.0.0.1:37978 in memory (size: 4.0 KB, free: 4.5 GB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
127.0.0.1:59714 in memory (size: 455.0 B, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 
127.0.0.1:37978 in memory (size: 455.0 B, free: 4.5 GB)
16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 4
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
127.0.0.1:59714 in memory (size: 9.2 KB, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 
127.0.0.1:37978 in memory (size: 9.2 KB, free: 4.5 GB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
127.0.0.1:59714 in memory (size: 3.6 KB, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 
127.0.0.1:37978 in memory (size: 3.6 KB, free: 4.5 GB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 
127.0.0.1:59714 in memory (size: 389.0 B, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 
127.0.0.1:37978 in memory (size: 389.0 B, free: 4.5 GB)
16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 3
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_12_piece0 on 
127.0.0.1:59714 in memory (size: 345.0 B, free: 511.1 MB)
16/05/20 01:04:52 INFO BlockManagerInfo: Removed broadcast_12_piece0 on 
127.0.0.1:37978 in memory (size: 345.0 B, free: 4.5 GB)
16/05/20 01:04:52 INFO ContextCleaner: Cleaned shuffle 2
16/05/20 01:04:52 INFO BlockManager: Removing RDD 19
16/05/20 01:04:52 INFO ContextCleaner: 

[jira] [Commented] (SPARK-15363) Example code shouldn't use VectorImplicits._, asML/fromML

2016-05-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292356#comment-15292356
 ] 

Miao Wang commented on SPARK-15363:
---

[~mengxr] I want to solve this problem by copying 
MultivariateOnlineSummarizer.scala to mllib-local using ml.vectors. So in the 
example, we don't have to import the VectorImplicits._. Do you think it is the 
fix that you want?
Thanks!

> Example code shouldn't use VectorImplicits._, asML/fromML
> -
>
> Key: SPARK-15363
> URL: https://issues.apache.org/jira/browse/SPARK-15363
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>
> In SPARK-14615, we use VectorImplicits._  and asML in example code to 
> minimize the changes in that PR. However, this is a private API, which 
> shouldn't appear in the example code. We should consider update them during 
> QA.
> https://github.com/dbtsai/spark/blob/9d25ebacfb4abf4d80d5f6815fac920d18347799/examples/src/main/scala/org/apache/spark/examples/ml/DataFrameExample.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15419) monotonicallyIncreasingId should use less memory with multiple partitions

2016-05-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-15419.
-
   Resolution: Duplicate
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0

Confirmed that the patch for [SPARK-15317] fixes this issue, or at least makes 
it much less significant.

> monotonicallyIncreasingId should use less memory with multiple partitions
> -
>
> Key: SPARK-15419
> URL: https://issues.apache.org/jira/browse/SPARK-15419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 1 worker
>Reporter: Joseph K. Bradley
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> When monotonicallyIncreasingId is used on a DataFrame with many partitions, 
> it uses a very large amount of memory.
> Consider this code:
> {code}
> import org.apache.spark.sql.functions._
> // JMAP1: run jmap -histo:live [PID]
> val numPartitions = 1000
> val df = spark.range(0, 100, 1, numPartitions).toDF("vtx")
> df.cache().count()
> // JMAP2
> val df2 = df.withColumn("id", monotonicallyIncreasingId())
> df2.cache().count()
> // JMAP3
> df2.select(col("id") + 1).count()
> // JMAP4
> {code}
> Here's how memory usage progresses:
> * JMAP1: This is just for calibration.
> * JMAP2: No significant change from 1.
> * JMAP3: Massive jump: 3048895 Longs, 1039638 Objects, 2007427 Integers, 
> 1002000 org.apache.spark.sql.catalyst.expressions.GenericInternalRow
> ** None of these had significant numbers of instances in JMAP1/2.
> * JMAP4: This doubles the object creation.  I.e., even after caching, it 
> keeps generating new objects on every use.
> When the indexed DataFrame is used repeatedly afterwards, the driver memory 
> usage keeps increasing and eventually blows up in my application.
> I wrote "with multiple partitions" because this issue goes away when 
> numPartitions is small (1 or 2).
> Presumably this memory usage could be reduced.
> Note: I also tested a custom indexing using RDD.zipWithIndex, and it is even 
> worse in terms of object creation (about 2x worse):
> {code}
> def zipWithUniqueIdFrom0(df: DataFrame): DataFrame = {
>   val sqlContext = df.sqlContext
>   val schema = df.schema
>   val outputSchema = StructType(Seq(
> StructField("row", schema, false), StructField("id", 
> DataTypes.IntegerType, false)))
>   val rdd = df.rdd.zipWithIndex().map { case (row: Row, id: Long) => Row(row, 
> id.toInt) }
>   sqlContext.createDataFrame(rdd, outputSchema)
> }
> // val df2 = df.withColumn("id", monotonicallyIncreasingId())
> val df2 = zipWithUniqueIdFrom0(df)
> df2.cache().count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14734) Add conversions between mllib and ml Vector, Matrix types

2016-05-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292340#comment-15292340
 ] 

DB Tsai commented on SPARK-14734:
-

We will deprecate the mllib methods in the following releases. As a result, we 
would like to have a clear cut between two APIs. I agree that it's debatable to 
make it public or not. My concern is once they're public, it's hard to go back.

> Add conversions between mllib and ml Vector, Matrix types
> -
>
> Key: SPARK-14734
> URL: https://issues.apache.org/jira/browse/SPARK-14734
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> For maintaining wrappers around spark.mllib algorithms in spark.ml, it will 
> be useful to have {{private[spark]}} methods for converting from one linear 
> algebra representation to another.  I am running into this issue in 
> [SPARK-14732].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292338#comment-15292338
 ] 

Apache Spark commented on SPARK-15420:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/13206

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15345) SparkSession's conf doesn't take effect when there's already an existing SparkContext

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292339#comment-15292339
 ] 

Apache Spark commented on SPARK-15345:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13200

> SparkSession's conf doesn't take effect when there's already an existing 
> SparkContext
> -
>
> Key: SPARK-15345
> URL: https://issues.apache.org/jira/browse/SPARK-15345
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Priority: Blocker
>
> I am working with branch-2.0, spark is compiled with hive support (-Phive and 
> -Phvie-thriftserver).
> I am trying to access databases using this snippet:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> hc.sql("show databases").collect()
> [Row(result='default')]
> {code}
> This means that spark doesn't find any databases specified in configuration.
> Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
> 1.6, and launching above snippet, I can print out existing databases.
> When run in DEBUG mode this is what spark (2.0) prints out:
> {code}
> 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
> 16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
> string])) null else input[0, string].toString, 
> StructField(result,StringType,false)), result#2) AS #3]   Project 
> [createexternalrow(if (isnull(result#2)) null else result#2.toString, 
> StructField(result,StringType,false)) AS #3]
>  +- LocalRelation [result#2]  
>   
>  +- LocalRelation [result#2]
> 
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
> org.apache.spark.sql.types.StructType 
> org.apache.spark.sql.Dataset$$anonfun$53.structType$1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
> (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1)
>  +++
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
> org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
> this is the starting closure
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting 
> closure: 0
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
> 16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
> 

[jira] [Assigned] (SPARK-15420) Repartition and sort before Parquet writes

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15420:


Assignee: (was: Apache Spark)

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15420) Repartition and sort before Parquet writes

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15420:


Assignee: Apache Spark

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14734) Add conversions between mllib and ml Vector, Matrix types

2016-05-19 Thread Koert Kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292335#comment-15292335
 ] 

Koert Kuipers commented on SPARK-14734:
---

Why does making them public make it harder for users to migrate their
application?

Currently you make it harder to upgrade existing applications to spark 2.0

If you want to move people away from mllib to ml the way to do so is to
deprecate mllib methods, not make life hard by leaving out methods, i think?



> Add conversions between mllib and ml Vector, Matrix types
> -
>
> Key: SPARK-14734
> URL: https://issues.apache.org/jira/browse/SPARK-14734
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> For maintaining wrappers around spark.mllib algorithms in spark.ml, it will 
> be useful to have {{private[spark]}} methods for converting from one linear 
> algebra representation to another.  I am running into this issue in 
> [SPARK-14732].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15422) Remove unnecessary calculation of stage's parents

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15422:


Assignee: Apache Spark

> Remove unnecessary calculation of stage's parents
> -
>
> Key: SPARK-15422
> URL: https://issues.apache.org/jira/browse/SPARK-15422
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: sharkd tu
>Assignee: Apache Spark
>
> Remove unnecessary calculation of stage's parents, because stage's parents 
> have been set at the time of stage construction.
> See https://github.com/apache/spark/pull/13123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15422) Remove unnecessary calculation of stage's parents

2016-05-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292331#comment-15292331
 ] 

Apache Spark commented on SPARK-15422:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13123

> Remove unnecessary calculation of stage's parents
> -
>
> Key: SPARK-15422
> URL: https://issues.apache.org/jira/browse/SPARK-15422
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: sharkd tu
>
> Remove unnecessary calculation of stage's parents, because stage's parents 
> have been set at the time of stage construction.
> See https://github.com/apache/spark/pull/13123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15422) Remove unnecessary calculation of stage's parents

2016-05-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15422:


Assignee: (was: Apache Spark)

> Remove unnecessary calculation of stage's parents
> -
>
> Key: SPARK-15422
> URL: https://issues.apache.org/jira/browse/SPARK-15422
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: sharkd tu
>
> Remove unnecessary calculation of stage's parents, because stage's parents 
> have been set at the time of stage construction.
> See https://github.com/apache/spark/pull/13123



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >