date:20160611

[GitHub] spark issue #13553: [SPARK-15814][SQL] Aggregator can return null result

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13553
  
**[Test build #60356 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60356/consoleFull)**
 for PR 13553 at commit 
[`3471199`](https://github.com/apache/spark/commit/34711996028e595a3e42d373a03594404f364802).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13553: [SPARK-15814][SQL] Aggregator can return null result

2016-06-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13553
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13604: [SPARK-15856][SQL] Revert API breaking changes made in D...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13604
  
**[Test build #60355 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60355/consoleFull)**
 for PR 13604 at commit 
[`15daf4c`](https://github.com/apache/spark/commit/15daf4cdeada07603f8bfdae45676c26e788b635).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13617
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13617
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60354/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13617
  
**[Test build #60354 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60354/consoleFull)**
 for PR 13617 at commit 
[`138fd25`](https://github.com/apache/spark/commit/138fd25775a52c087feac47288c4be6480624085).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13619: [SPARK-15892][ML] Incorrectly merged AFTAggregator with ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13619
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13619: [SPARK-15892][ML] Incorrectly merged AFTAggregator with ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13619
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60353/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13619: [SPARK-15892][ML] Incorrectly merged AFTAggregator with ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13619
  
**[Test build #60353 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60353/consoleFull)**
 for PR 13619 at commit 
[`4447d0a`](https://github.com/apache/spark/commit/4447d0a969229dfe5d6cf1bdfc3c0ac62c1fd53e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work ...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/12601#discussion_r66716164
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
 ---
@@ -96,7 +97,16 @@ private[sql] case class JDBCRelation(
 
   override val needConversion: Boolean = false
 
-  override val schema: StructType = JDBCRDD.resolveTable(url, table, 
properties)
+  override val schema: StructType = {
+val resolvedSchema = JDBCRDD.resolveTable(url, table, properties)
+providedSchemaOption match {
+  case Some(providedSchema) =>
+if (providedSchema.sql.toLowerCase == 
resolvedSchema.sql.toLowerCase) resolvedSchema
--- End diff --

I guess it would make sense if it does not try to apply the resolved schema 
when the schema is explicitly set like the other data sources.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work ...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/12601#discussion_r66716162
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -19,37 +19,105 @@ package org.apache.spark.sql.execution.datasources.jdbc
 
 import java.util.Properties
 
-import org.apache.spark.sql.SQLContext
-import org.apache.spark.sql.sources.{BaseRelation, DataSourceRegister, 
RelationProvider}
+import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, 
CreatableRelationProvider, DataSourceRegister, RelationProvider, 
SchemaRelationProvider}
+import org.apache.spark.sql.types.StructType
 
-class JdbcRelationProvider extends RelationProvider with 
DataSourceRegister {
+class JdbcRelationProvider extends CreatableRelationProvider
+  with SchemaRelationProvider with RelationProvider with 
DataSourceRegister {
 
   override def shortName(): String = "jdbc"
 
-  /** Returns a new base relation with the given parameters. */
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
-val jdbcOptions = new JDBCOptions(parameters)
-if (jdbcOptions.partitionColumn != null
-  && (jdbcOptions.lowerBound == null
-|| jdbcOptions.upperBound == null
-|| jdbcOptions.numPartitions == null)) {
+createRelation(sqlContext, parameters, null)
+  }
+
+  /** Returns a new base relation with the given parameters. */
+  override def createRelation(
+  sqlContext: SQLContext,
+  parameters: Map[String, String],
+  schema: StructType): BaseRelation = {
+val url = parameters.getOrElse("url", sys.error("Option 'url' not 
specified"))
+val table = parameters.getOrElse("dbtable", sys.error("Option 
'dbtable' not specified"))
+val partitionColumn = parameters.getOrElse("partitionColumn", null)
+val lowerBound = parameters.getOrElse("lowerBound", null)
+val upperBound = parameters.getOrElse("upperBound", null)
+val numPartitions = parameters.getOrElse("numPartitions", null)
+
+if (partitionColumn != null
+  && (lowerBound == null || upperBound == null || numPartitions == 
null)) {
   sys.error("Partitioning incompletely specified")
 }
 
-val partitionInfo = if (jdbcOptions.partitionColumn == null) {
-  null
-} else {
+val partitionInfo = if (partitionColumn == null) null
+else {
   JDBCPartitioningInfo(
-jdbcOptions.partitionColumn,
-jdbcOptions.lowerBound.toLong,
-jdbcOptions.upperBound.toLong,
-jdbcOptions.numPartitions.toInt)
+partitionColumn, lowerBound.toLong, upperBound.toLong, 
numPartitions.toInt)
 }
 val parts = JDBCRelation.columnPartition(partitionInfo)
 val properties = new Properties() // Additional properties that we 
will pass to getConnection
 parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
-JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, 
properties)(sqlContext.sparkSession)
+JDBCRelation(url, table, parts, properties, 
Option(schema))(sqlContext.sparkSession)
+  }
+
+  /*
+   * The following structure applies to this code:
+   * |tableExists|  !tableExists
+   
*
+   * Ignore  | BaseRelation  | CreateTable, saveTable, 
BaseRelation
+   * ErrorIfExists   | ERROR | CreateTable, saveTable, 
BaseRelation
+   * Overwrite   | DropTable, CreateTable,   | CreateTable, saveTable, 
BaseRelation
+   * | saveTable, BaseRelation   |
+   * Append  | saveTable, BaseRelation   | CreateTable, saveTable, 
BaseRelation
+   */
+  override def createRelation(
+  sqlContext: SQLContext,
+  mode: SaveMode,
+  parameters: Map[String, String],
+  data: DataFrame): BaseRelation = {
+val url = parameters.getOrElse("url",
+  sys.error("Saving jdbc source requires url to be set." +
+" (ie. df.option(\"url\", \"ACTUAL_URL\")"))
+val table = parameters.getOrElse("dbtable", 
parameters.getOrElse("table",
+  sys.error("Saving jdbc source requires dbtable to be set." +
+" (ie. df.option(\"dbtable\", \"ACTUAL_DB_TABLE\")")))
+
+import collection.JavaConverters._
+val props = new Properties()
+props.putAll(parameters.asJava)
+val conn = JdbcUtils.createConnectionFactory(url, props)()
+
+try {
+  val tableExists =

[GitHub] spark pull request #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work ...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/12601#discussion_r66716158
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -19,37 +19,105 @@ package org.apache.spark.sql.execution.datasources.jdbc
 
 import java.util.Properties
 
-import org.apache.spark.sql.SQLContext
-import org.apache.spark.sql.sources.{BaseRelation, DataSourceRegister, 
RelationProvider}
+import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, 
CreatableRelationProvider, DataSourceRegister, RelationProvider, 
SchemaRelationProvider}
+import org.apache.spark.sql.types.StructType
 
-class JdbcRelationProvider extends RelationProvider with 
DataSourceRegister {
+class JdbcRelationProvider extends CreatableRelationProvider
+  with SchemaRelationProvider with RelationProvider with 
DataSourceRegister {
 
   override def shortName(): String = "jdbc"
 
-  /** Returns a new base relation with the given parameters. */
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
-val jdbcOptions = new JDBCOptions(parameters)
-if (jdbcOptions.partitionColumn != null
-  && (jdbcOptions.lowerBound == null
-|| jdbcOptions.upperBound == null
-|| jdbcOptions.numPartitions == null)) {
+createRelation(sqlContext, parameters, null)
+  }
+
+  /** Returns a new base relation with the given parameters. */
+  override def createRelation(
+  sqlContext: SQLContext,
+  parameters: Map[String, String],
+  schema: StructType): BaseRelation = {
+val url = parameters.getOrElse("url", sys.error("Option 'url' not 
specified"))
+val table = parameters.getOrElse("dbtable", sys.error("Option 
'dbtable' not specified"))
+val partitionColumn = parameters.getOrElse("partitionColumn", null)
+val lowerBound = parameters.getOrElse("lowerBound", null)
+val upperBound = parameters.getOrElse("upperBound", null)
+val numPartitions = parameters.getOrElse("numPartitions", null)
--- End diff --

There is a class for those options, `JDBCOptions`. It would be nicer if 
those options are managed in a single place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work ...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/12601#discussion_r66716161
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -19,37 +19,105 @@ package org.apache.spark.sql.execution.datasources.jdbc
 
 import java.util.Properties
 
-import org.apache.spark.sql.SQLContext
-import org.apache.spark.sql.sources.{BaseRelation, DataSourceRegister, 
RelationProvider}
+import org.apache.spark.sql.{DataFrame, SaveMode, SQLContext}
+import org.apache.spark.sql.sources.{BaseRelation, 
CreatableRelationProvider, DataSourceRegister, RelationProvider, 
SchemaRelationProvider}
+import org.apache.spark.sql.types.StructType
 
-class JdbcRelationProvider extends RelationProvider with 
DataSourceRegister {
+class JdbcRelationProvider extends CreatableRelationProvider
+  with SchemaRelationProvider with RelationProvider with 
DataSourceRegister {
 
   override def shortName(): String = "jdbc"
 
-  /** Returns a new base relation with the given parameters. */
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
-val jdbcOptions = new JDBCOptions(parameters)
-if (jdbcOptions.partitionColumn != null
-  && (jdbcOptions.lowerBound == null
-|| jdbcOptions.upperBound == null
-|| jdbcOptions.numPartitions == null)) {
+createRelation(sqlContext, parameters, null)
+  }
+
+  /** Returns a new base relation with the given parameters. */
+  override def createRelation(
+  sqlContext: SQLContext,
+  parameters: Map[String, String],
+  schema: StructType): BaseRelation = {
+val url = parameters.getOrElse("url", sys.error("Option 'url' not 
specified"))
+val table = parameters.getOrElse("dbtable", sys.error("Option 
'dbtable' not specified"))
+val partitionColumn = parameters.getOrElse("partitionColumn", null)
+val lowerBound = parameters.getOrElse("lowerBound", null)
+val upperBound = parameters.getOrElse("upperBound", null)
+val numPartitions = parameters.getOrElse("numPartitions", null)
+
+if (partitionColumn != null
+  && (lowerBound == null || upperBound == null || numPartitions == 
null)) {
   sys.error("Partitioning incompletely specified")
 }
 
-val partitionInfo = if (jdbcOptions.partitionColumn == null) {
-  null
-} else {
+val partitionInfo = if (partitionColumn == null) null
+else {
   JDBCPartitioningInfo(
-jdbcOptions.partitionColumn,
-jdbcOptions.lowerBound.toLong,
-jdbcOptions.upperBound.toLong,
-jdbcOptions.numPartitions.toInt)
+partitionColumn, lowerBound.toLong, upperBound.toLong, 
numPartitions.toInt)
 }
 val parts = JDBCRelation.columnPartition(partitionInfo)
 val properties = new Properties() // Additional properties that we 
will pass to getConnection
 parameters.foreach(kv => properties.setProperty(kv._1, kv._2))
-JDBCRelation(jdbcOptions.url, jdbcOptions.table, parts, 
properties)(sqlContext.sparkSession)
+JDBCRelation(url, table, parts, properties, 
Option(schema))(sqlContext.sparkSession)
+  }
+
+  /*
+   * The following structure applies to this code:
+   * |tableExists|  !tableExists
+   
*
+   * Ignore  | BaseRelation  | CreateTable, saveTable, 
BaseRelation
+   * ErrorIfExists   | ERROR | CreateTable, saveTable, 
BaseRelation
+   * Overwrite   | DropTable, CreateTable,   | CreateTable, saveTable, 
BaseRelation
+   * | saveTable, BaseRelation   |
+   * Append  | saveTable, BaseRelation   | CreateTable, saveTable, 
BaseRelation
+   */
+  override def createRelation(
+  sqlContext: SQLContext,
+  mode: SaveMode,
+  parameters: Map[String, String],
+  data: DataFrame): BaseRelation = {
+val url = parameters.getOrElse("url",
+  sys.error("Saving jdbc source requires url to be set." +
+" (ie. df.option(\"url\", \"ACTUAL_URL\")"))
+val table = parameters.getOrElse("dbtable", 
parameters.getOrElse("table",
+  sys.error("Saving jdbc source requires dbtable to be set." +
+" (ie. df.option(\"dbtable\", \"ACTUAL_DB_TABLE\")")))
+
+import collection.JavaConverters._
--- End diff --

I think this just can be imported at the class level rather than trying to 
import this for every time it creates a relation.


---
If your project is set up for it, you can reply to

[GitHub] spark issue #13614: Support Stata-like tabulation of values in a single colu...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13614
  
I think it will be nicer if it follows 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r66716081
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -29,7 +30,10 @@ import org.apache.spark._
 import org.apache.spark.internal.Logging
 import org.apache.spark.util.{RedirectThread, Utils}
 
-private[spark] class PythonWorkerFactory(pythonExec: String, envVars: 
Map[String, String])
+
+private[spark] class PythonWorkerFactory(pythonExec: String,
+ envVars: Map[String, String],
+ conf: SparkConf)
--- End diff --

Meybe as below..

```scala
private[spark] class PythonWorkerFactory(
pythonExec: String,
envVars: Map[String, String],
conf: SparkConf)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13617
  
**[Test build #60354 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60354/consoleFull)**
 for PR 13617 at commit 
[`138fd25`](https://github.com/apache/spark/commit/138fd25775a52c087feac47288c4be6480624085).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13619: [SPARK-15892][ML] Incorrectly merged AFTAggregator with ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13619
  
**[Test build #60353 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60353/consoleFull)**
 for PR 13619 at commit 
[`4447d0a`](https://github.com/apache/spark/commit/4447d0a969229dfe5d6cf1bdfc3c0ac62c1fd53e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13619: [SPARK-15892][ML] Incorrectly merged AFTAggregator with ...

2016-06-11 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/13619
  
@jkbradley Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread koertkuipers

Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/8416
  
this patch should not have broken reading files that include comma.
i also added unit test for this:

https://github.com/apache/spark/pull/8416/files#diff-5d2ebf4e9ca5a990136b276859769289R896


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...

2016-06-11 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r66715565
  
--- Diff: docs/ml-classification-regression.md ---
@@ -685,6 +685,76 @@ The implementation matches the result from R's 
survival function
 
 
 
+## Isotonic regression
+[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression)
+belongs to the family of regression algorithms. Formally isotonic 
regression is a problem where
+given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` 
representing observed responses
+and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
+finding a function that minimises
+
+`\begin{equation}
+  f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
+\end{equation}`
+
+with respect to complete order subject to
+`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights.
+The resulting function is called isotonic regression and it is unique.
+It can be viewed as least squares problem under order restriction.
+Essentially isotonic regression is a
+[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
+best fitting the original data points.
+
+In `spark.ml`, we implement a
--- End diff --

@jkbradley Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13381
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60352/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13381
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13381
  
**[Test build #60352 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60352/consoleFull)**
 for PR 13381 at commit 
[`083f884`](https://github.com/apache/spark/commit/083f8840ac724333e1ae6600f5d231b787230b81).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread avulanov

Github user avulanov commented on the issue:

https://github.com/apache/spark/pull/13621
  
@mengxr @jkbradley could you take a look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13381
  
**[Test build #60352 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60352/consoleFull)**
 for PR 13381 at commit 
[`083f884`](https://github.com/apache/spark/commit/083f8840ac724333e1ae6600f5d231b787230b81).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-11 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/13381
  
Other than that, this looks ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...

2016-06-11 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r66714994
  
--- Diff: docs/ml-classification-regression.md ---
@@ -685,6 +685,76 @@ The implementation matches the result from R's 
survival function
 
 
 
+## Isotonic regression
+[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression)
+belongs to the family of regression algorithms. Formally isotonic 
regression is a problem where
+given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` 
representing observed responses
+and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
+finding a function that minimises
+
+`\begin{equation}
+  f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
+\end{equation}`
+
+with respect to complete order subject to
+`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights.
+The resulting function is called isotonic regression and it is unique.
+It can be viewed as least squares problem under order restriction.
+Essentially isotonic regression is a
+[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
+best fitting the original data points.
+
+In `spark.ml`, we implement a
--- End diff --

I'd avoid using ```spark.ml```.  It was useful when there were 2 active 
APIs, but the naming confuses people sometimes.  I'd just start with "We 
implement..."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13621
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60351/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13621
  
**[Test build #60351 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60351/consoleFull)**
 for PR 13621 at commit 
[`b3f5539`](https://github.com/apache/spark/commit/b3f5539b45f86309b17394a9f7ba88d82dcd124f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13621
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13586: [SPARK-15860] Metrics for codegen size and perf

2016-06-11 Thread ericl

Github user ericl commented on the issue:

https://github.com/apache/spark/pull/13586
  
Fixed

On Sat, Jun 11, 2016, 7:07 PM UCB AMPLab  wrote:

> Test PASSed.
>
>
> Refer to this link for build results (access rights to CI server needed):
>
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60349/
> Test PASSed.
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13586: [SPARK-15860] Metrics for codegen size and perf

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13586
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60349/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13586: [SPARK-15860] Metrics for codegen size and perf

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13586
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13586: [SPARK-15860] Metrics for codegen size and perf

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13586
  
**[Test build #60349 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60349/consoleFull)**
 for PR 13586 at commit 
[`4a85cb4`](https://github.com/apache/spark/commit/4a85cb427385548a2bdf939c3e5f486e20b9967b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714698
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper {
   .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = 
a.isGenerated)).getOrElse(a)
 }
   }
+
+  /**
+   * Drop the non-partition key expression in the disjunctions, to 
optimize the partition pruning.
--- End diff --

Oh, OK, originally, I think the conjunction cases was handled in 
`collectProjectsAndFilters` already, before being passed into this function, 
and here, we only handle the `AND` in the disjunction. (You can see this in 
HiveTableScans in HiveStrategies.scala)

Anyway, you convinced me. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13545: [SPARK-15807][SQL] Support varargs for dropDuplicates in...

2016-06-11 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/13545
  
Thank you for merging, @rxin !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13621
  
**[Test build #60351 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60351/consoleFull)**
 for PR 13621 at commit 
[`b3f5539`](https://github.com/apache/spark/commit/b3f5539b45f86309b17394a9f7ba88d82dcd124f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13524: [SPARK-15776][SQL] Type coercion incorrect

2016-06-11 Thread watermen

Github user watermen commented on the issue:

https://github.com/apache/spark/pull/13524
  
@davies @rxin Can you review the code for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13561: [SPARK-15824][SQL] Run 'with ... insert ... select' fail...

2016-06-11 Thread watermen

Github user watermen commented on the issue:

https://github.com/apache/spark/pull/13561
  
@davies @rxin Can you review the code for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread yangw1234

Github user yangw1234 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714468
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper {
   .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = 
a.isGenerated)).getOrElse(a)
 }
   }
+
+  /**
+   * Drop the non-partition key expression in the disjunctions, to 
optimize the partition pruning.
--- End diff --

It is `(part=1 conjunction a=1) disjunction (part=2 conjunction a=4)`, 
right? But the expression get dropped is `a=1` which is in "conjunction" with 
`part=1` and `a=4` which is in "conjunction" with `part=2`. So I thought it 
should be conjunctions.

Or maybe we can phrase it in another way to avoid the confusion? ^_^




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13621
  
**[Test build #60350 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60350/consoleFull)**
 for PR 13621 at commit 
[`adc81ba`](https://github.com/apache/spark/commit/adc81ba1f1b6fb014bb1813de3ab283f841585d5).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class StackedAutoencoder (override val uid: String)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13621
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60350/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13621
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13621
  
**[Test build #60350 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60350/consoleFull)**
 for PR 13621 at commit 
[`adc81ba`](https://github.com/apache/spark/commit/adc81ba1f1b6fb014bb1813de3ab283f841585d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13621: [SPARK-2623] [ML] Implement stacked autoencoder

2016-06-11 Thread avulanov

GitHub user avulanov opened a pull request:

https://github.com/apache/spark/pull/13621

[SPARK-2623] [ML] Implement stacked autoencoder

## What changes were proposed in this pull request?
Implement stacked autoencoder
- Base on ml.ann Layer and LossFunction
- Implement two loss functions `EmptyLayerWithSquaredError` and 
`SigmoidLayerWithSquaredError` to handle inputs (-inf, +inf) and [0, 1]
- Implement greedy training
- Provide encoder and decoder

## How was this patch tested?
Provide unit tests
- Gradient correctness of the new LossFunctions
- Correct reconstruction of the original data by encoding and decoding 
(based on Berkeley's CS182)
- Successful pre-training of deep network with 6 hidden layers

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/avulanov/spark autoencoder-mlp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13621.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13621


commit adc81ba1f1b6fb014bb1813de3ab283f841585d5
Author: avulanov 
Date:   2016-04-04T23:06:25Z

Implement stacked autoencoder
- Base on ml.ann Layer and LossFunction
- Implement two new loss functions EmptyLayerWithSquaredError and 
SigmoidLayerWithSquaredError to handle inputs [-inf, +inf] and [0, 1]
- Implement greedy training
- Provide encoder and decoder




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714358
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -65,15 +65,20 @@ private[hive] trait HiveStrategies {
 // hive table scan operator to be used for partition pruning.
 val partitionKeyIds = AttributeSet(relation.partitionKeys)
 val (pruningPredicates, otherPredicates) = predicates.partition { 
predicate =>
-  !predicate.references.isEmpty &&
+  predicate.references.nonEmpty &&
   predicate.references.subsetOf(partitionKeyIds)
 }
+val additionalPartPredicates =
+  PhysicalOperation.partitionPrunningFromDisjunction(
+otherPredicates.foldLeft[Expression](Literal(true))(And(_, 
_)), partitionKeyIds)
 
 pruneFilterProject(
   projectList,
   otherPredicates,
   identity[Seq[Expression]],
-  HiveTableScanExec(_, relation, pruningPredicates)(sparkSession)) 
:: Nil
+HiveTableScanExec(_,
+relation,
+pruningPredicates ++ additionalPartPredicates)(sparkSession)) 
:: Nil
--- End diff --

For `HiveTableScan`, the predicate here just to minimize the partition 
scanning, so what we need to do is to put a more specific partition pruning 
predicate.

Sorry if there is something confused.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714324
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper {
   .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = 
a.isGenerated)).getOrElse(a)
 }
   }
+
+  /**
+   * Drop the non-partition key expression in the disjunctions, to 
optimize the partition pruning.
--- End diff --

I think it's should be `disjunction`. for example:

`(part=1 and a=1) or (part = 2 and a=4)`, this should be disjunction, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714314
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala ---
@@ -65,4 +69,95 @@ class QueryPartitionSuite extends QueryTest with 
SQLTestUtils with TestHiveSingl
   sql("DROP TABLE IF EXISTS createAndInsertTest")
 }
   }
+
+  test("partition pruning in disjunction") {
+withSQLConf((SQLConf.HIVE_VERIFY_PARTITION_PATH.key, "true")) {
+  val testData = sparkContext.parallelize(
+(1 to 10).map(i => TestData(i, i.toString))).toDF()
+  testData.registerTempTable("testData")
+
+  val testData2 = sparkContext.parallelize(
+(11 to 20).map(i => TestData(i, i.toString))).toDF()
+  testData2.registerTempTable("testData2")
+
+  val testData3 = sparkContext.parallelize(
+(21 to 30).map(i => TestData(i, i.toString))).toDF()
+  testData3.registerTempTable("testData3")
+
+  val testData4 = sparkContext.parallelize(
+(31 to 40).map(i => TestData(i, i.toString))).toDF()
+  testData4.registerTempTable("testData4")
+
+  val tmpDir = Files.createTempDir()
+  // create the table for test
+  sql(s"CREATE TABLE table_with_partition(key int,value string) " +
+s"PARTITIONED by (ds string, ds2 string) location 
'${tmpDir.toURI.toString}' ")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='1', 
ds2='d1') " +
+"SELECT key,value FROM testData")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='2', 
ds2='d1') " +
+"SELECT key,value FROM testData2")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='3', 
ds2='d3') " +
+"SELECT key,value FROM testData3")
+  sql("INSERT OVERWRITE TABLE table_with_partition  partition (ds='4', 
ds2='d4') " +
+"SELECT key,value FROM testData4")
+
+  checkAnswer(sql("select key,value from table_with_partition"),
+testData.collect ++ testData2.collect ++ testData3.collect ++ 
testData4.collect)
+
+  checkAnswer(
+sql(
+  """select key,value from table_with_partition
+| where (ds='4' and key=38) or (ds='3' and 
key=22)""".stripMargin),
+  Row(38, "38") :: Row(22, "22") :: Nil)
+
+  checkAnswer(
+sql(
+  """select key,value from table_with_partition
+| where (key<40 and key>38) or (ds='3' and 
key=22)""".stripMargin),
+Row(39, "39") :: Row(22, "22") :: Nil)
+
+  sql("DROP TABLE table_with_partition")
+  sql("DROP TABLE createAndInsertTest")
--- End diff --

good catch. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13585: [SPARK-15859][SQL] Optimize the partition pruning within...

2016-06-11 Thread chenghao-intel

Github user chenghao-intel commented on the issue:

https://github.com/apache/spark/pull/13585
  
Thank you all for the review, but I am not going to solve the CNF, the 
intention of this PR is to exact more partition pruning expression, so we will 
get have less partition to scan during the table scanning.

But I did find some bug in this PR, will add more unit test soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13585: [SPARK-15859][SQL] Optimize the partition pruning...

2016-06-11 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/13585#discussion_r66714297
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala
 ---
@@ -92,6 +92,36 @@ object PhysicalOperation extends PredicateHelper {
   .map(Alias(_, a.name)(a.exprId, a.qualifier, isGenerated = 
a.isGenerated)).getOrElse(a)
 }
   }
+
+  /**
+   * Drop the non-partition key expression in the disjunctions, to 
optimize the partition pruning.
+   * For instances: (We assume part1 & part2 are the partition keys)
+   * (part1 == 1 and a > 3) or (part2 == 2 and a < 5)  ==> (part1 == 1 or 
part1 == 2)
+   * (part1 == 1 and a > 3) or (a < 100) => None
+   * (a > 100 && b < 100) or (part1 = 10) => None
+   * (a > 100 && b < 100 and part1 = 10) or (part1 == 2) => (part1 = 10 or 
part1 == 2)
+   * @param predicate disjunctions
+   * @param partitionKeyIds partition keys in attribute set
+   * @return
+   */
+  def partitionPrunningFromDisjunction(
+predicate: Expression, partitionKeyIds: AttributeSet): 
Option[Expression] = {
+// ignore the pure non-partition key expression in conjunction of the 
expression tree
+val additionalPartPredicate = predicate transformUp {
+  case a @ And(left, right) if a.deterministic &&
+left.references.intersect(partitionKeyIds).isEmpty => right
+  case a @ And(left, right) if a.deterministic &&
+right.references.intersect(partitionKeyIds).isEmpty => left
--- End diff --

Actually, the output of `!(partition = 1 && a > 3)` should be 
`!(partition=1)`, what should be dropped here is the `a>3`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/12836
  
We can do it in a separate pr -- it'd be great to move all Python and R 
methods over to a single class. Otherwise it has two major problems:

1. Those methods are public for Java.

2. It is very difficult to refactor (because IDEs don't know the caller in 
non-scala/java languages)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13602: [SPARK-15878][CORE][TEST] fix cleanup in EventLoggingLis...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13602
  
**[Test build #3080 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3080/consoleFull)**
 for PR 13602 at commit 
[`90f2ac1`](https://github.com/apache/spark/commit/90f2ac144186c78c6516987b0cb82ec2c232ff34).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class ReplayListenerSuite extends SparkFunSuite with BeforeAndAfter 
with LocalSparkContext `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12983: [SPARK-15213][PySpark] Unify 'range' usages

2016-06-11 Thread asfer

Github user asfer commented on the issue:

https://github.com/apache/spark/pull/12983
  
Apart from the merge conflict, everything looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13338: [SPARK-13723] [YARN] Change behavior of --num-exe...

2016-06-11 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/13338#discussion_r66714041
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2262,21 +2262,39 @@ private[spark] object Utils extends Logging {
   }
 
   /**
-   * Return whether dynamic allocation is enabled in the given conf
-   * Dynamic allocation and explicitly setting the number of executors are 
inherently
-   * incompatible. In environments where dynamic allocation is turned on 
by default,
-   * the latter should override the former (SPARK-9092).
+   * Return whether dynamic allocation is enabled in the given conf.
*/
   def isDynamicAllocationEnabled(conf: SparkConf): Boolean = {
-val numExecutor = conf.getInt("spark.executor.instances", 0)
 val dynamicAllocationEnabled = 
conf.getBoolean("spark.dynamicAllocation.enabled", false)
-if (numExecutor != 0 && dynamicAllocationEnabled) {
-  logWarning("Dynamic Allocation and num executors both set, thus 
dynamic allocation disabled.")
-}
-numExecutor == 0 && dynamicAllocationEnabled &&
+dynamicAllocationEnabled &&
   (!isLocalMaster(conf) || 
conf.getBoolean("spark.dynamicAllocation.testing", false))
   }
 
+  /**
+   * Return the minimum number of executors for dynamic allocation.
+   */
+  def getDynamicAllocationMinExecutors(conf: SparkConf): Int = {
+conf.getInt("spark.dynamicAllocation.minExecutors", 0)
--- End diff --

I see, so those aren't Strings. Thanks, I'll fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12836
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12836
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60348/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12836
  
**[Test build #60348 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60348/consoleFull)**
 for PR 12836 at commit 
[`d51441f`](https://github.com/apache/spark/commit/d51441f704e2abad7f7a3cc829664cd201b0fcd2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/12836
  
@rxin I think in this case we need access to grouping expression and 
DataFrame from within the RelationalGroupedDataset class. One solution could be 
to move the function `flatMapGroupsInR` to the helper object which is already 
`private[sql]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13586: [SPARK-15860] Metrics for codegen size and perf

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13586
  
**[Test build #60349 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60349/consoleFull)**
 for PR 13586 at commit 
[`4a85cb4`](https://github.com/apache/spark/commit/4a85cb427385548a2bdf939c3e5f486e20b9967b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on ...

2016-06-11 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/12836#discussion_r66713462
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -381,6 +385,50 @@ class RelationalGroupedDataset protected[sql](
   def pivot(pivotColumn: String, values: java.util.List[Any]): 
RelationalGroupedDataset = {
 pivot(pivotColumn, values.asScala)
   }
+
+  /**
+   * Applies the given serialized R function `func` to each group of data. 
For each unique group,
+   * the function will be passed the group key and an iterator that 
contains all of the elements in
+   * the group. The function can return an iterator containing elements of 
an arbitrary type which
+   * will be returned as a new [[DataFrame]].
+   *
+   * This function does not support partial aggregation, and as a result 
requires shuffling all
+   * the data in the [[Dataset]]. If an application intends to perform an 
aggregation over each
+   * key, it is best to use the reduce function or an
+   * [[org.apache.spark.sql.expressions#Aggregator Aggregator]].
+   *
+   * Internally, the implementation will spill to disk if any given group 
is too large to fit into
+   * memory.  However, users must take care to avoid materializing the 
whole iterator for a group
+   * (for example, by calling `toList`) unless they are sure that this is 
possible given the memory
+   * constraints of their cluster.
+   *
+   * @since 2.0.0
+   */
+  private[sql] def flatMapGroupsInR(
+  f: Array[Byte],
+  packageNames: Array[Byte],
+  broadcastVars: Array[Object],
+  outputSchema: StructType): DataFrame = {
+  val broadcastVarObj = 
broadcastVars.map(_.asInstanceOf[Broadcast[Object]])
+  val groupingNamedExpressions = groupingExprs.map(alias)
+  val groupingCols = groupingNamedExpressions.map(Column(_))
+  val groupingDataFrame = df.select(groupingCols : _*)
+  val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
+  val realOutputSchema = if (outputSchema == null) 
SERIALIZED_R_DATA_SCHEMA else outputSchema
--- End diff --

If schema should not be null can we assert it to be not null on the R side 
and just pass in a non-null value always ? I think for `dapply` we wanted to 
support `collect` on the result of the UDF which could work even without a 
schema. 

The other nice way to handle this would be to do construct the binary 
schema that we fall back on from the R side and pass that in (i.e. keeping all 
input validation in R and just logic in scala)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13482: [SPARK-15725][YARN] Ensure ApplicationMaster slee...

2016-06-11 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/13482#discussion_r66713200
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -462,10 +464,23 @@ private[spark] class ApplicationMaster(
   nextAllocationInterval = initialAllocationInterval
   heartbeatInterval
 }
-  logDebug(s"Number of pending allocations is 
$numPendingAllocate. " +
-   s"Sleeping for $sleepInterval.")
+  sleepStart = System.currentTimeMillis()
   allocatorLock.wait(sleepInterval)
 }
+val sleepDuration = System.currentTimeMillis() - sleepStart
+if (sleepDuration < sleepInterval - 5) {
+  // log when sleep is interrupted
+  logInfo(s"Number of pending allocations is 
$numPendingAllocate. " +
--- End diff --

These are the only signal we have that the allocation loop is getting 
signalled too much. I think it's worth an info message so we can identify other 
cases that are causing this behavior. The normal case where the thread already 
slept for more than the min interval is debug. This doesn't add an unreasonable 
number of log messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13482: [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for ...

2016-06-11 Thread rdblue

Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/13482
  
@andrewor14, I think we should consider two problems here: the fact that 
the thread will sleep for less than the min interval if something triggers it 
and whatever is currently triggering it. We should certainly fix the loss 
reason request that is currently triggering this behavior, but I still think 
that this patch is a good solution to the first problem in case there are other 
situations that cause it as well.

There's not a good reason to sleep for less than the min interval if it can 
cause the application to become unstable. We could look at a more complicated 
strategy -- like an exponentially increasing min interval up to the current min 
-- but the important thing right now is to ensure nothing can cause this 
instability.

To be clear, I don't consider this a complete fix for both of those 
problems. We should definitely avoid the `askWithRetry`, only signal the 
allocator thread when necessary, etc. But as a safety precaution, I think this 
patch is a good start.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13482: [SPARK-15725][YARN] Ensure ApplicationMaster slee...

2016-06-11 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/13482#discussion_r66713080
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -462,10 +464,23 @@ private[spark] class ApplicationMaster(
   nextAllocationInterval = initialAllocationInterval
   heartbeatInterval
 }
-  logDebug(s"Number of pending allocations is 
$numPendingAllocate. " +
-   s"Sleeping for $sleepInterval.")
+  sleepStart = System.currentTimeMillis()
   allocatorLock.wait(sleepInterval)
 }
+val sleepDuration = System.currentTimeMillis() - sleepStart
+if (sleepDuration < sleepInterval - 5) {
--- End diff --

If the sleep would be for less than 5ms, then I thought that was close 
enough to not go back to sleep and have a context switch. I'm fine without it, 
but I think it's reasonable to to have a minimum time to go back to sleep for.

The problem with using `allocatorLock.wait` is that the thread can be 
signalled and interrupt sleep. That's a good idea if the `sleepInterval` is 
3000 ms, but if we want to ensure a minimum amount of time, then the second 
sleep should not be interrupted by signalling the `allocatorLock`, which is why 
`Thread.sleep` is used instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13602: [SPARK-15878][CORE][TEST] fix cleanup in EventLoggingLis...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13602
  
**[Test build #3080 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3080/consoleFull)**
 for PR 13602 at commit 
[`90f2ac1`](https://github.com/apache/spark/commit/90f2ac144186c78c6516987b0cb82ec2c232ff34).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/8416
  
@marmbrus mind sharing how to do it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread marmbrus

Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/8416
  
@rxin I believe I fixed that limitation in my recent refactoring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13161: [SPARK-14851] [Core] Support radix sort with nullable lo...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13161
  
FYI I accidentally merged this in branch-2.0 too, but I reverted it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13586: [SPARK-15860] Metrics for codegen size and perf

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13586
  
Looks like this failed some tests.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13545: [SPARK-15807][SQL] Support varargs for dropDuplic...

2016-06-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13545


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13545: [SPARK-15807][SQL] Support varargs for dropDuplicates in...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13545
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13161: [SPARK-14851] [Core] Support radix sort with null...

2016-06-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13161


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13161: [SPARK-14851] [Core] Support radix sort with nullable lo...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13161
  
Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13595
  
@tdas this matches what we discussed right?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #8416: [SPARK-10185] [SQL] Feat sql comma separated paths

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/8416
  
Is there still a way to specify a file that includes comma after this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13605: [SPARK-15856][SQL] Revert API breaking changes ma...

2016-06-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13605


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13604: [SPARK-15856][SQL] Revert API breaking changes made in D...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13604
  
Can you update the pr and link to this ticket? 
https://issues.apache.org/jira/browse/SPARK-15898

Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13607: [SPARK-15881] Update microbenchmark results for W...

2016-06-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13607


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13605: [SPARK-15856][SQL] Revert API breaking changes made in S...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13605
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13607: [SPARK-15881] Update microbenchmark results for WideSche...

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13607
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13607: [SPARK-15881] Update microbenchmark results for W...

2016-06-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13607#discussion_r66712617
  
--- Diff: project/SparkBuild.scala ---
@@ -833,7 +833,7 @@ object TestSettings {
 javaOptions in Test += "-Dspark.ui.enabled=false",
 javaOptions in Test += "-Dspark.ui.showConsoleProgress=false",
 javaOptions in Test += "-Dspark.unsafe.exceptionOnMemoryLeak=true",
-javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=true",
+javaOptions in Test += 
"-Dsun.io.serialization.extendedDebugInfo=false",
--- End diff --

i guess we can turn spark.serializer.extraDebugInfo on


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13607: [SPARK-15881] Update microbenchmark results for W...

2016-06-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13607#discussion_r66712619
  
--- Diff: project/SparkBuild.scala ---
@@ -833,7 +833,7 @@ object TestSettings {
 javaOptions in Test += "-Dspark.ui.enabled=false",
 javaOptions in Test += "-Dspark.ui.showConsoleProgress=false",
 javaOptions in Test += "-Dspark.unsafe.exceptionOnMemoryLeak=true",
-javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=true",
+javaOptions in Test += 
"-Dsun.io.serialization.extendedDebugInfo=false",
--- End diff --

it's on by default - ok this pr lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13607: [SPARK-15881] Update microbenchmark results for W...

2016-06-11 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13607#discussion_r66712579
  
--- Diff: project/SparkBuild.scala ---
@@ -833,7 +833,7 @@ object TestSettings {
 javaOptions in Test += "-Dspark.ui.enabled=false",
 javaOptions in Test += "-Dspark.ui.showConsoleProgress=false",
 javaOptions in Test += "-Dspark.unsafe.exceptionOnMemoryLeak=true",
-javaOptions in Test += "-Dsun.io.serialization.extendedDebugInfo=true",
+javaOptions in Test += 
"-Dsun.io.serialization.extendedDebugInfo=false",
--- End diff --

hm this has been very useful in debugging when we encounter 
non-serializable exceptions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12836
  
**[Test build #60348 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60348/consoleFull)**
 for PR 12836 at commit 
[`d51441f`](https://github.com/apache/spark/commit/d51441f704e2abad7f7a3cc829664cd201b0fcd2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13616: [SPARK-15585][SQL] Add doc for turning off quotat...

2016-06-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13616


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13616: [SPARK-15585][SQL] Add doc for turning off quotations

2016-06-11 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13616
  
Merging in master/2.0. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13552: [SPARK-15813] Improve Canceling log message to make it l...

2016-06-11 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/13552
  
The line is too long @peterableda but otherwise looks fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on DataFra...

2016-06-11 Thread NarineK

Github user NarineK commented on the issue:

https://github.com/apache/spark/pull/12836
  
Thanks @liancheng and @rxin  !
With respect to your point, @rxin  -  "private[sql] signature in public 
APIs ."

dapply added that signature to `Dataset.scala `and gapply adds it to 
`RelationalGroupedDataset.scala` classes.
We can think of pulling those out in to helper method but maybe we can do 
it in a separate jira ? 
cc: @shivaram , @sun-rui  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12836: [SPARK-12922][SparkR][WIP] Implement gapply() on ...

2016-06-11 Thread NarineK

Github user NarineK commented on a diff in the pull request:

https://github.com/apache/spark/pull/12836#discussion_r66712035
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -381,6 +385,50 @@ class RelationalGroupedDataset protected[sql](
   def pivot(pivotColumn: String, values: java.util.List[Any]): 
RelationalGroupedDataset = {
 pivot(pivotColumn, values.asScala)
   }
+
+  /**
+   * Applies the given serialized R function `func` to each group of data. 
For each unique group,
+   * the function will be passed the group key and an iterator that 
contains all of the elements in
+   * the group. The function can return an iterator containing elements of 
an arbitrary type which
+   * will be returned as a new [[DataFrame]].
+   *
+   * This function does not support partial aggregation, and as a result 
requires shuffling all
+   * the data in the [[Dataset]]. If an application intends to perform an 
aggregation over each
+   * key, it is best to use the reduce function or an
+   * [[org.apache.spark.sql.expressions#Aggregator Aggregator]].
+   *
+   * Internally, the implementation will spill to disk if any given group 
is too large to fit into
+   * memory.  However, users must take care to avoid materializing the 
whole iterator for a group
+   * (for example, by calling `toList`) unless they are sure that this is 
possible given the memory
+   * constraints of their cluster.
+   *
+   * @since 2.0.0
+   */
+  private[sql] def flatMapGroupsInR(
+  f: Array[Byte],
+  packageNames: Array[Byte],
+  broadcastVars: Array[Object],
+  outputSchema: StructType): DataFrame = {
+  val broadcastVarObj = 
broadcastVars.map(_.asInstanceOf[Broadcast[Object]])
+  val groupingNamedExpressions = groupingExprs.map(alias)
+  val groupingCols = groupingNamedExpressions.map(Column(_))
+  val groupingDataFrame = df.select(groupingCols : _*)
+  val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
+  val realOutputSchema = if (outputSchema == null) 
SERIALIZED_R_DATA_SCHEMA else outputSchema
--- End diff --

@liancheng , thank you for the review comments.

Those are good suggestions, however for:

case 1:  using Option[StructType] ...  -  I gave a try but since this 
method is being called from R side we need to somehow instantiate 
"scala.Option" class and this doesn't seem to be primitive to do in R. 
From R side we will basically call the following method:
`org.apache.spark.sql.Dataset flatMapGroupsInR 
(byte[] f, byte[] packageNames, java.lang.Object[] broadcastVars, 
scala.Option outputSchema)`

Case 2: Similar to dapply, gapply forces schema by signature, the default 
value doesn't really work here.

But I can make the changes if it is preferred.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13615: Fix the import typo in Python example

2016-06-11 Thread sjjpo2002

Github user sjjpo2002 commented on the issue:

https://github.com/apache/spark/pull/13615
  
I'm closing this. I'm not sure how this caused build errors. I just 
referred to the typo in the documentation and didn't make any changes!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13615: Fix the import typo in Python example

2016-06-11 Thread sjjpo2002

Github user sjjpo2002 closed the pull request at:

https://github.com/apache/spark/pull/13615


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13617
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60347/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13617
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13617
  
**[Test build #60347 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60347/consoleFull)**
 for PR 13617 at commit 
[`46783ac`](https://github.com/apache/spark/commit/46783acdb5de62530f1cfdc9c69a54f969d42d7e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13617
  
**[Test build #60347 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60347/consoleFull)**
 for PR 13617 at commit 
[`46783ac`](https://github.com/apache/spark/commit/46783acdb5de62530f1cfdc9c69a54f969d42d7e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13617: [SPARK-10409] [ML] Add Multilayer Perceptron Regression ...

2016-06-11 Thread MLnick

Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/13617
  
jenkins add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13611
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60346/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13611
  
**[Test build #60346 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60346/consoleFull)**
 for PR 13611 at commit 
[`3c6b9fd`](https://github.com/apache/spark/commit/3c6b9fd624faf22659f8d06950f756ece593422d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...

2016-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13611
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13611: [SPARK-15887][SQL] Bring back the hive-site.xml support ...

2016-06-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13611
  
**[Test build #60346 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60346/consoleFull)**
 for PR 13611 at commit 
[`3c6b9fd`](https://github.com/apache/spark/commit/3c6b9fd624faf22659f8d06950f756ece593422d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13611: [SPARK-15887][SQL] Bring back the hive-site.xml s...

2016-06-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13611#discussion_r66710277
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala ---
@@ -40,10 +42,16 @@ private[sql] class SharedState(val sparkContext: 
SparkContext) {
*/
   val listener: SQLListener = createListenerAndUI(sparkContext)
 
+  lazy val hadoopConf: Configuration = {
+val conf = sparkContext.hadoopConfiguration
+
conf.addResource(Utils.getContextOrSparkClassLoader.getResource("hive-site.xml"))
--- End diff --

is it better than `conf.addResouorce("hive-site.xml")`? Which corner case 
do we worry about it? cc @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 178 matches

Mail list logo