[GitHub] spark pull request #16373: [SPARK-18961][SQL] Support `SHOW TABLE EXTENDED ....

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16373#discussion_r94004059
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -619,18 +621,34 @@ case class ShowTablesCommand(
 // instead of calling tables in sparkSession.
 val catalog = sparkSession.sessionState.catalog
 val db = databaseName.getOrElse(catalog.getCurrentDatabase)
-val tables =
-  tableIdentifierPattern.map(catalog.listTables(db, 
_)).getOrElse(catalog.listTables(db))
-tables.map { tableIdent =>
-  val database = tableIdent.database.getOrElse("")
-  val tableName = tableIdent.table
-  val isTemp = catalog.isTemporaryTable(tableIdent)
-  if (isExtended) {
-val information = 
catalog.getTempViewOrPermanentTableMetadata(tableIdent).toString
-Row(database, tableName, isTemp, s"${information}\n")
-  } else {
-Row(database, tableName, isTemp)
+if (partitionSpec.isEmpty) {
+  // Show the information of tables.
+  val tables =
+tableIdentifierPattern.map(catalog.listTables(db, 
_)).getOrElse(catalog.listTables(db))
+  tables.map { tableIdent =>
+val database = tableIdent.database.getOrElse("")
+val tableName = tableIdent.table
+val isTemp = catalog.isTemporaryTable(tableIdent)
+if (isExtended) {
+  val information = 
catalog.getTempViewOrPermanentTableMetadata(tableIdent).toString
+  Row(database, tableName, isTemp, s"${information}\n")
--- End diff --

Do you know why we need `s"${information}\n"`, instead of `information`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70654/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15996
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70654 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70654/testReport)**
 for PR 15996 at commit 
[`7f8d8c9`](https://github.com/apache/spark/commit/7f8d8c9f8f62d46094d4c66dff5ebc4de1b4b2de).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14452
  
**[Test build #70659 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70659/testReport)**
 for PR 14452 at commit 
[`aeba1c3`](https://github.com/apache/spark/commit/aeba1c31f72508c6a93d4f056a8333f4f01e6f80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14452: [SPARK-16849][SQL] Improve subquery execution by ...

2016-12-27 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14452#discussion_r94001153
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1167,3 +1173,211 @@ object RemoveRepetitionFromGroupExpressions extends 
Rule[LogicalPlan] {
   a.copy(groupingExpressions = newGrouping)
   }
 }
+
+/**
+ * Optimizes the logical plans wrapped in SubqueryAlias and operators on 
them.
+ * The SubqueryAlias which are remaining in optimization phase are common 
subqueries,
+ * i.e., they are duplicate in the whole query plan. The logical plans 
wrapped in
+ * SubqueryAlias will be executed individually later. However, some 
operators such as
+ * Project and Filter can be optimized with the wrapped logical plans. 
Thus, this rule
+ * considers the optimization of the wrapped logical plans and operators 
on SubqueryAlias.
+ */
+case class OptimizeCommonSubqueries(optimizer: Optimizer)
+extends Rule[LogicalPlan] with PredicateHelper {
+  // Optimized the subqueries which all have a Project parent node and the 
same results.
+  private def optimizeProjectWithSubqueries(
+  plan: LogicalPlan,
+  keyPlan: LogicalPlan,
+  subqueries: ArrayBuffer[LogicalPlan]): LogicalPlan = {
+plan transform {
+  case p @ Project(pList, s @ SubqueryAlias(alias, subquery, v, true))
+  if s.sameResult(keyPlan) =>
+val pListForAll: Seq[NamedExpression] = subqueries.flatMap { case 
Project(pList, child) =>
+  val rewrites = buildRewrites(child, subquery)
+  pList.map(pushToOtherPlan(_, rewrites))
+}
+
+val newSubquery = Project(pListForAll, subquery)
+val optimized = optimizer.execute(newSubquery)
+// Check if any optimization is performed.
+if (optimized.sameResult(newSubquery)) {
+  // No optimization happens. Let's keep original subquery.
+  p
+} else {
+  Project(pList.map(_.toAttribute), SubqueryAlias(alias, 
newSubquery, v, true))
+}
+}
+  }
+
+  /**
+   * Maps Attributes from the source side to the corresponding Attribute 
on the target side.
+   */
+  private def buildRewrites(source: LogicalPlan, target: LogicalPlan): 
AttributeMap[Attribute] = {
+assert(source.output.size == target.output.size)
+AttributeMap(source.output.zip(target.output))
+  }
+
+  /**
+   * Rewrites an expression so that it can be pushed to another 
LogicalPlan.
+   */
+  private def pushToOtherPlan[A <: Expression](e: A, rewrites: 
AttributeMap[Attribute]) = {
+val result = e transformUp {
+  case a: Attribute => rewrites.get(a).getOrElse(a)
+}
+
+// We must promise the compiler that we did not discard the names in 
the case of project
+// expressions.  This is safe since the only transformation is from 
Attribute => Attribute.
+result.asInstanceOf[A]
+  }
+
+  private def optimizeFilterWithSubqueries(
+  plan: LogicalPlan,
+  keyPlan: LogicalPlan,
+  subqueries: ArrayBuffer[LogicalPlan]): LogicalPlan = {
+var pushdownConds = 
splitConjunctivePredicates(subqueries(0).asInstanceOf[Filter].condition)
+subqueries.tail.foreach {
+  case Filter(otherCond, child) =>
+val rewrites = buildRewrites(child, 
subqueries(0).asInstanceOf[Filter].child)
+// We can't simply push down all conditions from other Filter by 
concatenating them with
--- End diff --

This part has been extracted out as #15558 and can be removed if that PR is 
merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16419: [MINOR][DOC]Fix doc of ForeachWriter to use writeStream

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16419
  
**[Test build #70658 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70658/testReport)**
 for PR 16419 at commit 
[`175d941`](https://github.com/apache/spark/commit/175d941370ba5585f3ba55b5f0998059e9e1b6d7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r94001020
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -60,23 +60,27 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
 
 val conn = JdbcUtils.createConnectionFactory(jdbcOptions)()
 try {
-  val tableExists = JdbcUtils.tableExists(conn, url, table)
+  val tableSchema = JdbcUtils.getSchema(conn, url, table)
+  val tableExists = tableSchema.isDefined
+  val caseSensitive = sqlContext.conf.caseSensitiveAnalysis
   if (tableExists) {
 mode match {
   case SaveMode.Overwrite =>
-if (isTruncate && isCascadingTruncateTable(url) == 
Some(false)) {
+val savingSchema = if (isTruncate && 
isCascadingTruncateTable(url) == Some(false)) {
   // In this case, we should truncate table and then load.
   truncateTable(conn, table)
-  saveTable(df, url, table, jdbcOptions)
+  JdbcUtils.getSavingSchema(df.schema, tableSchema.get, 
caseSensitive)
 } else {
   // Otherwise, do not truncate the table, instead drop and 
recreate it
   dropTable(conn, table)
   createTable(df.schema, url, table, createTableOptions, conn)
-  saveTable(df, url, table, jdbcOptions)
+  df.schema
 }
+saveTable(df, url, table, savingSchema, jdbcOptions)
--- End diff --

How about passing the table schema and resolve the schema inside 
`saveTable`? It might simplify the codes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16419: [MINOR][DOC]Fix doc of ForeachWriter to use write...

2016-12-27 Thread carsonwang
GitHub user carsonwang opened a pull request:

https://github.com/apache/spark/pull/16419

[MINOR][DOC]Fix doc of ForeachWriter to use writeStream

## What changes were proposed in this pull request?

Fix the document of `ForeachWriter` to use `writeStream` instead of `write` 
for a streaming dataset.

## How was this patch tested?
Docs only.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/carsonwang/spark FixDoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16419


commit 175d941370ba5585f3ba55b5f0998059e9e1b6d7
Author: Carson Wang 
Date:   2016-12-28T06:50:01Z

Fix doc of ForeachWriter




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r94000876
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -211,6 +211,52 @@ object JdbcUtils extends Logging {
   }
 
   /**
+   * Returns the schema if the table already exists in the JDBC database.
+   */
+  def getSchema(conn: Connection, url: String, table: String): 
Option[StructType] = {
+val dialect = JdbcDialects.get(url)
+
+Try {
+  val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
+  try {
+getSchema(statement.executeQuery(), dialect)
+  } finally {
+statement.close()
+  }
+} match {
+  case Success(v) =>
+Some(v)
+  case Failure(e) =>
+None
+}
+  }
+
+  /**
+   * Returns the saving schema using rddSchema's sequence and 
tableSchema's name.
--- End diff --

Here, we need to explain why we need to use the column sequences in 
`rddSchema` and why we need to use the column names in `tableSchema`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r94000781
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -211,6 +211,52 @@ object JdbcUtils extends Logging {
   }
 
   /**
+   * Returns the schema if the table already exists in the JDBC database.
+   */
+  def getSchema(conn: Connection, url: String, table: String): 
Option[StructType] = {
+val dialect = JdbcDialects.get(url)
+
+Try {
+  val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
+  try {
+getSchema(statement.executeQuery(), dialect)
+  } finally {
+statement.close()
+  }
+} match {
+  case Success(v) =>
+Some(v)
+  case Failure(e) =>
+None
+}
+  }
+
+  /**
+   * Returns the saving schema using rddSchema's sequence and 
tableSchema's name.
+   */
+  def getSavingSchema(
+  rddSchema: StructType,
+  tableSchema: StructType,
+  caseSensitive: Boolean): StructType = {
+val nameMap = tableSchema.fields.map(f => f.name -> f).toMap
+val lowercaseNameMap = tableSchema.fields.map(f => f.name.toLowerCase 
-> f).toMap
+
+var schema = new StructType()
+rddSchema.fields.foreach { f =>
+  if (nameMap.isDefinedAt(f.name)) {
+// identical names
+schema = schema.add(nameMap(f.name))
+  } else if (!caseSensitive && 
lowercaseNameMap.isDefinedAt(f.name.toLowerCase)) {
+// case-insensitive identical names
+schema = schema.add(lowercaseNameMap(f.name.toLowerCase))
+  } else {
+throw new org.apache.spark.SparkException(s"""Column "${f.name}" 
not found""")
--- End diff --

`org.apache.spark.SparkException ` -> `AnalysisException`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r94000757
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -211,6 +211,52 @@ object JdbcUtils extends Logging {
   }
 
   /**
+   * Returns the schema if the table already exists in the JDBC database.
+   */
+  def getSchema(conn: Connection, url: String, table: String): 
Option[StructType] = {
+val dialect = JdbcDialects.get(url)
+
+Try {
+  val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
+  try {
+getSchema(statement.executeQuery(), dialect)
+  } finally {
+statement.close()
+  }
+} match {
+  case Success(v) =>
+Some(v)
+  case Failure(e) =>
+None
+}
+  }
+
+  /**
+   * Returns the saving schema using rddSchema's sequence and 
tableSchema's name.
+   */
+  def getSavingSchema(
+  rddSchema: StructType,
+  tableSchema: StructType,
+  caseSensitive: Boolean): StructType = {
+val nameMap = tableSchema.fields.map(f => f.name -> f).toMap
+val lowercaseNameMap = tableSchema.fields.map(f => f.name.toLowerCase 
-> f).toMap
+
+var schema = new StructType()
+rddSchema.fields.foreach { f =>
+  if (nameMap.isDefinedAt(f.name)) {
+// identical names
+schema = schema.add(nameMap(f.name))
+  } else if (!caseSensitive && 
lowercaseNameMap.isDefinedAt(f.name.toLowerCase)) {
+// case-insensitive identical names
--- End diff --

Need to improve the comments. Actually, we return case sensitive column 
names. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13909
  
**[Test build #70657 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70657/testReport)**
 for PR 13909 at commit 
[`cfe2e3d`](https://github.com/apache/spark/commit/cfe2e3d9defd7c04e921e886dc129725ce06fc67).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r94000523
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -211,6 +211,52 @@ object JdbcUtils extends Logging {
   }
 
   /**
+   * Returns the schema if the table already exists in the JDBC database.
+   */
+  def getSchema(conn: Connection, url: String, table: String): 
Option[StructType] = {
+val dialect = JdbcDialects.get(url)
+
+Try {
+  val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
+  try {
+getSchema(statement.executeQuery(), dialect)
+  } finally {
+statement.close()
+  }
+} match {
+  case Success(v) =>
+Some(v)
+  case Failure(e) =>
+None
+}
+  }
+
+  /**
+   * Returns the saving schema using rddSchema's sequence and 
tableSchema's name.
--- End diff --

`saving schema` is not right. We need a better name here. 

`rddSchema's sequence and tableSchema's name` -> `rddSchema's column 
sequence and tableSchema's column names`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r93999899
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala
 ---
@@ -60,23 +60,27 @@ class JdbcRelationProvider extends 
CreatableRelationProvider
 
 val conn = JdbcUtils.createConnectionFactory(jdbcOptions)()
 try {
-  val tableExists = JdbcUtils.tableExists(conn, url, table)
+  val tableSchema = JdbcUtils.getSchema(conn, url, table)
+  val tableExists = tableSchema.isDefined
+  val caseSensitive = sqlContext.conf.caseSensitiveAnalysis
   if (tableExists) {
 mode match {
   case SaveMode.Overwrite =>
-if (isTruncate && isCascadingTruncateTable(url) == 
Some(false)) {
+val savingSchema = if (isTruncate && 
isCascadingTruncateTable(url) == Some(false)) {
   // In this case, we should truncate table and then load.
   truncateTable(conn, table)
-  saveTable(df, url, table, jdbcOptions)
+  JdbcUtils.getSavingSchema(df.schema, tableSchema.get, 
caseSensitive)
--- End diff --

Based on your current implementation, the `tableSchema` could be None, 
right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70656 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70656/testReport)**
 for PR 16291 at commit 
[`e09d66e`](https://github.com/apache/spark/commit/e09d66ef58049481924dcdf88157bbb981cac7fd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16364: [DOC][BUILD][MINOR] add doc on new make-distribution swi...

2016-12-27 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16364
  
thanks @srowen - merged to master.
we have currently info on R packaging at 
https://github.com/apache/spark/blob/master/R/CRAN_RELEASE.md - from a quick 
check we don't have any existing example of docs linking to a .md file not 
under docs but in git. I guess I will need to think more on how that should be 
organized, and how that is referenced by `--r` switch etc.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15664: [SPARK-18123][SQL] Use db column names instead of...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15664#discussion_r93999377
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 ---
@@ -211,6 +211,52 @@ object JdbcUtils extends Logging {
   }
 
   /**
+   * Returns the schema if the table already exists in the JDBC database.
+   */
+  def getSchema(conn: Connection, url: String, table: String): 
Option[StructType] = {
+val dialect = JdbcDialects.get(url)
+
+Try {
+  val statement = conn.prepareStatement(dialect.getSchemaQuery(table))
+  try {
+getSchema(statement.executeQuery(), dialect)
+  } finally {
+statement.close()
+  }
+} match {
+  case Success(v) =>
+Some(v)
+  case Failure(e) =>
+None
+}
--- End diff --

Please do not use Try/Success/Failure. 
https://github.com/databricks/scala-style-guide#exception-handling-try-vs-try




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16364: [DOC][BUILD][MINOR] add doc on new make-distribut...

2016-12-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16364


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16410: [SPARK-19005][SQL] Keep column ordering when a schema is...

2016-12-27 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/16410
  
Aha, okay and I'll fix that way. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16410: [SPARK-19005][SQL] Keep column ordering when a schema is...

2016-12-27 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16410
  
`create table t(a int, b int) partitioned by (a)`, the schema of table t 
is: .

This behavior is intentional and already published, we can not change it. 
What we should do is to find out other places that don't follow this rule and 
respect the given schema, i.e. you are doing the opposite thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16404: [SPARK-18969][SQL] Support grouping by nondetermi...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16404#discussion_r93998144
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1918,28 +1918,37 @@ class Analyzer(
   case p: Project => p
   case f: Filter => f
 
+  case a: Aggregate if a.groupingExpressions.exists(!_.deterministic) 
=>
+val nondeterToAttr = getNondeterToAttr(a.groupingExpressions)
+val newChild = Project(a.child.output ++ nondeterToAttr.values, 
a.child)
+a.transformExpressions { case e =>
+  nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
+}.copy(child = newChild)
+
   // todo: It's hard to write a general rule to pull out 
nondeterministic expressions
   // from LogicalPlan, currently we only do it for UnaryNode which has 
same output
   // schema with its child.
   case p: UnaryNode if p.output == p.child.output && 
p.expressions.exists(!_.deterministic) =>
-val nondeterministicExprs = 
p.expressions.filterNot(_.deterministic).flatMap { expr =>
-  val leafNondeterministic = expr.collect {
-case n: Nondeterministic => n
-  }
-  leafNondeterministic.map { e =>
-val ne = e match {
-  case n: NamedExpression => n
-  case _ => Alias(e, "_nondeterministic")(isGenerated = true)
-}
-new TreeNodeRef(e) -> ne
-  }
-}.toMap
+val nondeterToAttr = getNondeterToAttr(p.expressions)
 val newPlan = p.transformExpressions { case e =>
-  nondeterministicExprs.get(new 
TreeNodeRef(e)).map(_.toAttribute).getOrElse(e)
+  nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }
-val newChild = Project(p.child.output ++ 
nondeterministicExprs.values, p.child)
+val newChild = Project(p.child.output ++ nondeterToAttr.values, 
p.child)
 Project(p.output, newPlan.withNewChildren(newChild :: Nil))
 }
+
+private def getNondeterToAttr(exprs: Seq[Expression]): Map[Expression, 
NamedExpression] = {
+  exprs.filterNot(_.deterministic).flatMap { expr =>
+val leafNondeterministic = expr.collect { case n: Nondeterministic 
=> n }
--- End diff --

`statefulUDF()` is a stateful/non-deterministic UDF which does not exend 
`Nondeterministic`, but its `deterministic` is equal to `false` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16404: [SPARK-18969][SQL] Support grouping by nondetermi...

2016-12-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16404#discussion_r93997721
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1918,28 +1918,37 @@ class Analyzer(
   case p: Project => p
   case f: Filter => f
 
+  case a: Aggregate if a.groupingExpressions.exists(!_.deterministic) 
=>
+val nondeterToAttr = getNondeterToAttr(a.groupingExpressions)
+val newChild = Project(a.child.output ++ nondeterToAttr.values, 
a.child)
+a.transformExpressions { case e =>
+  nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
+}.copy(child = newChild)
+
   // todo: It's hard to write a general rule to pull out 
nondeterministic expressions
   // from LogicalPlan, currently we only do it for UnaryNode which has 
same output
   // schema with its child.
   case p: UnaryNode if p.output == p.child.output && 
p.expressions.exists(!_.deterministic) =>
-val nondeterministicExprs = 
p.expressions.filterNot(_.deterministic).flatMap { expr =>
-  val leafNondeterministic = expr.collect {
-case n: Nondeterministic => n
-  }
-  leafNondeterministic.map { e =>
-val ne = e match {
-  case n: NamedExpression => n
-  case _ => Alias(e, "_nondeterministic")(isGenerated = true)
-}
-new TreeNodeRef(e) -> ne
-  }
-}.toMap
+val nondeterToAttr = getNondeterToAttr(p.expressions)
 val newPlan = p.transformExpressions { case e =>
-  nondeterministicExprs.get(new 
TreeNodeRef(e)).map(_.toAttribute).getOrElse(e)
+  nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }
-val newChild = Project(p.child.output ++ 
nondeterministicExprs.values, p.child)
+val newChild = Project(p.child.output ++ nondeterToAttr.values, 
p.child)
 Project(p.output, newPlan.withNewChildren(newChild :: Nil))
 }
+
+private def getNondeterToAttr(exprs: Seq[Expression]): Map[Expression, 
NamedExpression] = {
+  exprs.filterNot(_.deterministic).flatMap { expr =>
+val leafNondeterministic = expr.collect { case n: Nondeterministic 
=> n }
--- End diff --


https://github.com/apache/spark/pull/16404/files#diff-a0f2e45a5da747e9ec483f3557aa1b8bR138


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16404: [SPARK-18969][SQL] Support grouping by nondetermi...

2016-12-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16404#discussion_r93997700
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1918,28 +1918,37 @@ class Analyzer(
   case p: Project => p
   case f: Filter => f
 
+  case a: Aggregate if a.groupingExpressions.exists(!_.deterministic) 
=>
+val nondeterToAttr = getNondeterToAttr(a.groupingExpressions)
+val newChild = Project(a.child.output ++ nondeterToAttr.values, 
a.child)
+a.transformExpressions { case e =>
+  nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
+}.copy(child = newChild)
+
   // todo: It's hard to write a general rule to pull out 
nondeterministic expressions
   // from LogicalPlan, currently we only do it for UnaryNode which has 
same output
   // schema with its child.
   case p: UnaryNode if p.output == p.child.output && 
p.expressions.exists(!_.deterministic) =>
-val nondeterministicExprs = 
p.expressions.filterNot(_.deterministic).flatMap { expr =>
-  val leafNondeterministic = expr.collect {
-case n: Nondeterministic => n
-  }
-  leafNondeterministic.map { e =>
-val ne = e match {
-  case n: NamedExpression => n
-  case _ => Alias(e, "_nondeterministic")(isGenerated = true)
-}
-new TreeNodeRef(e) -> ne
-  }
-}.toMap
+val nondeterToAttr = getNondeterToAttr(p.expressions)
 val newPlan = p.transformExpressions { case e =>
-  nondeterministicExprs.get(new 
TreeNodeRef(e)).map(_.toAttribute).getOrElse(e)
+  nondeterToAttr.get(e).map(_.toAttribute).getOrElse(e)
 }
-val newChild = Project(p.child.output ++ 
nondeterministicExprs.values, p.child)
+val newChild = Project(p.child.output ++ nondeterToAttr.values, 
p.child)
 Project(p.output, newPlan.withNewChildren(newChild :: Nil))
 }
+
+private def getNondeterToAttr(exprs: Seq[Expression]): Map[Expression, 
NamedExpression] = {
+  exprs.filterNot(_.deterministic).flatMap { expr =>
+val leafNondeterministic = expr.collect { case n: Nondeterministic 
=> n }
--- End diff --

is it same with the existing test? `select a, rand(0), sum(b) from data 
group by a, 2`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13909: [SPARK-16213][SQL] Reduce runtime overhead of a program ...

2016-12-27 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13909
  
yea we should! thanks for pointing this out!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93996591
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,121 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`、`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`、`view2`、`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u @ UnresolvedRelation(table: TableIdentifier, _) if 
isRunningDirectlyOnFiles(table) =>
+u
+  case u: UnresolvedRelation =>
+resolveView(lookupTableFromCatalog(u, defaultDatabase))
+}
--- End diff --

I will try to combine the both functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93996552
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala 
---
@@ -543,4 +545,34 @@ class SQLViewSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
   }
 }
   }
+
+  test("correctly resolve a nested view") {
+withTempDatabase { db =>
+  withView(s"$db.view1", s"$db.view2") {
+val view1 = CatalogTable(
+  identifier = TableIdentifier("view1", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  provider = Some("parquet"),
--- End diff --

I have had some mis-understanding about a view that references a datasource 
table, will change that ASAF! Thank you and sorry for that trouble!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15996
  
**[Test build #70654 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70654/testReport)**
 for PR 15996 at commit 
[`7f8d8c9`](https://github.com/apache/spark/commit/7f8d8c9f8f62d46094d4c66dff5ebc4de1b4b2de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93996247
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,121 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`、`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`、`view2`、`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u @ UnresolvedRelation(table: TableIdentifier, _) if 
isRunningDirectlyOnFiles(table) =>
+u
+  case u: UnresolvedRelation =>
+resolveView(lookupTableFromCatalog(u, defaultDatabase))
+}
--- End diff --

The name of `resolveView` is confusing. We are also resolving regular 
tables. 

What is the reason why we are unable to use the `resolveRelation` to handle 
all the cases? (That means, can we combine `resolveView` and `resolveRelation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93995925
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,121 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`、`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`、`view2`、`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u @ UnresolvedRelation(table: TableIdentifier, _) if 
isRunningDirectlyOnFiles(table) =>
+u
+  case u: UnresolvedRelation =>
+resolveView(lookupTableFromCatalog(u, defaultDatabase))
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
+// 1. If the database part is defined in the table identifier, use 
that database name;
+// 2. Else If the defaultDatabase is defined, use the default database 
name;
+// 3. Else use the currentDb of the SessionCatalog.
+private def lookupTableFromCatalog(
+u: UnresolvedRelation,
+defaultDatabase: Option[String] = None): LogicalPlan = {
   try {
-catalog.lookupRelation(u.tableIdentifier, u.alias)
+catalog.lookupRelation(u.tableIdentifier, u.alias, defaultDatabase)
   } catch {
 case _: NoSuchTableException =>
   u.failAnalysis(s"Table or view not found: ${u.tableName}")
   }
 }
 
-def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
-  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
-i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
-  case u: UnresolvedRelation =>
-val table = u.tableIdentifier
-if (table.database.isDefined && conf.runSQLonFile && 
!catalog.isTemporaryTable(table) &&
-(!catalog.databaseExists(table.database.get) || 
!catalog.tableExists(table))) {
-  // If the database part is specified, and we support running SQL 
directly on files, and
-  // it's not a temporary view, and the table does not exist, then 
let's just return the
-  // original UnresolvedRelation. It is possible we are matching a 
query like "select *
-  // from parquet.`/path/to/query`". The plan will get resolved 
later.
-

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93995903
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93995882
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLViewSuite.scala 
---
@@ -543,4 +545,34 @@ class SQLViewSuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
   }
 }
   }
+
+  test("correctly resolve a nested view") {
+withTempDatabase { db =>
+  withView(s"$db.view1", s"$db.view2") {
+val view1 = CatalogTable(
+  identifier = TableIdentifier("view1", Some(db)),
+  tableType = CatalogTableType.VIEW,
+  storage = CatalogStorageFormat.empty,
+  schema = new StructType().add("id", "int").add("id1", "int"),
+  provider = Some("parquet"),
--- End diff --

When a relation is `View`, the provider is empty. I do not know understand 
why we need to provide a provider in this case? What does `parquet` means for a 
view?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93995592
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -377,6 +378,36 @@ case class InsertIntoTable(
   override lazy val resolved: Boolean = childrenResolved && table.resolved
 }
 
+/** Factory for constructing new `View` nodes. */
+object View {
+  def apply(desc: CatalogTable): View = View(desc, 
desc.schema.toAttributes, None)
+}
+
+/**
+ * A container for holding the view description(CatalogTable), and the 
output of the view. The
+ * child will be defined if the view is defined in a Hive metastore or the 
view is resolved,
+ * else it should be None.
+ * This operator will be removed at the end of analysis stage.
+ *
+ * @param desc A view description(CatalogTable) that provides necessary 
information to resolve the
+ * view.
+ * @param output The output of a view operator, this is generated during 
planning the view, so that
+ *   we are able to decouple the output from the underlying 
structure.
+ * @param child The logical plan of a view operator, it should be 
non-empty if the view is defined
+ *  in a Hive metastore or the view is resolved, else it 
should be None.
+ */
+case class View(
+desc: CatalogTable,
+output: Seq[Attribute],
+child: Option[LogicalPlan] = None) extends LogicalPlan with 
MultiInstanceRelation {
+
+  override lazy val resolved: Boolean = child.exists(_.resolved)
+
+  override def children: Seq[LogicalPlan] = child.toSeq
+
+  override def newInstance(): LogicalPlan = copy(output = 
output.map(_.newInstance()))
+}
--- End diff --

Sure. name + output attributes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-12-27 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14452
  
Q47 is a typical query that can benefit from this PR:

with v1 as(
select i_category, i_brand,
   s_store_name, s_company_name,
   d_year, d_moy,
   sum(ss_sales_price) sum_sales,
   avg(sum(ss_sales_price)) over
 (partition by i_category, i_brand,
s_store_name, s_company_name, d_year)
 avg_monthly_sales,
   rank() over
 (partition by i_category, i_brand,
s_store_name, s_company_name
  order by d_year, d_moy) rn
from item, store_sales, date_dim, store
where ss_item_sk = i_item_sk and
  ss_sold_date_sk = d_date_sk and
  ss_store_sk = s_store_sk and
  (
d_year = 1999 or
( d_year = 1999-1 and d_moy =12) or
( d_year = 1999+1 and d_moy =1)
group by i_category, i_brand,
 s_store_name, s_company_name,
 d_year, d_moy),
v2 as(
select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, 
v1.d_year,
   v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales 
psum,
   v1_lead.sum_sales nsum
from v1, v1 v1_lag, v1 v1_lead
where v1.i_category = v1_lag.i_category and
  v1.i_category = v1_lead.i_category and
  v1.i_brand = v1_lag.i_brand and
  v1.i_brand = v1_lead.i_brand and
  v1.s_store_name = v1_lag.s_store_name and
  v1.s_store_name = v1_lead.s_store_name and
  v1.s_company_name = v1_lag.s_company_name and
  v1.s_company_name = v1_lead.s_company_name and
  v1.rn = v1_lag.rn + 1 and
  v1.rn = v1_lead.rn - 1)
select * from v2
where  d_year = 1999 and
   avg_monthly_sales > 0 and
   case when avg_monthly_sales > 0 then abs(sum_sales - 
avg_monthly_sales) / avg_monthly_sales else null end > 0.1
order by sum_sales - avg_monthly_sales, 3
limit 100

It has 2 CTEs, `v1` and `v2`. `v2` joins three duplicated `v1` plans. There 
are no disjunctive predicates needed to be pushed down to those `v1` plans. So 
the three duplicated `v1` plans have the same output data in the end. The 
physical plan of `v1` looks complicated.

Obviously without this PR we will run three physical plans of `v1`. This PR 
runs the physical plan of `v1` for only one time and cache its output data.







---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13326: [SPARK-15560] [Mesos] Queued/Supervise drivers waiting f...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13326
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70653/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13326: [SPARK-15560] [Mesos] Queued/Supervise drivers waiting f...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13326
  
**[Test build #70653 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70653/testReport)**
 for PR 13326 at commit 
[`38f24ea`](https://github.com/apache/spark/commit/38f24eaa327b7cdf8a76697039e19caedc8e2eec).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13326: [SPARK-15560] [Mesos] Queued/Supervise drivers waiting f...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13326
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13326: [SPARK-15560] [Mesos] Queued/Supervise drivers waiting f...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13326
  
**[Test build #70653 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70653/testReport)**
 for PR 13326 at commit 
[`38f24ea`](https://github.com/apache/spark/commit/38f24eaa327b7cdf8a76697039e19caedc8e2eec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93994361
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -377,6 +378,36 @@ case class InsertIntoTable(
   override lazy val resolved: Boolean = childrenResolved && table.resolved
 }
 
+/** Factory for constructing new `View` nodes. */
+object View {
+  def apply(desc: CatalogTable): View = View(desc, 
desc.schema.toAttributes, None)
+}
+
+/**
+ * A container for holding the view description(CatalogTable), and the 
output of the view. The
+ * child will be defined if the view is defined in a Hive metastore or the 
view is resolved,
+ * else it should be None.
+ * This operator will be removed at the end of analysis stage.
+ *
+ * @param desc A view description(CatalogTable) that provides necessary 
information to resolve the
+ * view.
+ * @param output The output of a view operator, this is generated during 
planning the view, so that
+ *   we are able to decouple the output from the underlying 
structure.
+ * @param child The logical plan of a view operator, it should be 
non-empty if the view is defined
+ *  in a Hive metastore or the view is resolved, else it 
should be None.
+ */
+case class View(
+desc: CatalogTable,
+output: Seq[Attribute],
+child: Option[LogicalPlan] = None) extends LogicalPlan with 
MultiInstanceRelation {
+
+  override lazy val resolved: Boolean = child.exists(_.resolved)
+
+  override def children: Seq[LogicalPlan] = child.toSeq
+
+  override def newInstance(): LogicalPlan = copy(output = 
output.map(_.newInstance()))
+}
--- End diff --

How about we output the `CatalogTable.qualifiedName`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70652/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70652 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70652/testReport)**
 for PR 16291 at commit 
[`6494cfe`](https://github.com/apache/spark/commit/6494cfe171dce7845d0db8d53f845a1df933d202).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r93993539
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -377,6 +378,36 @@ case class InsertIntoTable(
   override lazy val resolved: Boolean = childrenResolved && table.resolved
 }
 
+/** Factory for constructing new `View` nodes. */
+object View {
+  def apply(desc: CatalogTable): View = View(desc, 
desc.schema.toAttributes, None)
+}
+
+/**
+ * A container for holding the view description(CatalogTable), and the 
output of the view. The
+ * child will be defined if the view is defined in a Hive metastore or the 
view is resolved,
+ * else it should be None.
+ * This operator will be removed at the end of analysis stage.
+ *
+ * @param desc A view description(CatalogTable) that provides necessary 
information to resolve the
+ * view.
+ * @param output The output of a view operator, this is generated during 
planning the view, so that
+ *   we are able to decouple the output from the underlying 
structure.
+ * @param child The logical plan of a view operator, it should be 
non-empty if the view is defined
+ *  in a Hive metastore or the view is resolved, else it 
should be None.
+ */
+case class View(
+desc: CatalogTable,
+output: Seq[Attribute],
+child: Option[LogicalPlan] = None) extends LogicalPlan with 
MultiInstanceRelation {
+
+  override lazy val resolved: Boolean = child.exists(_.resolved)
+
+  override def children: Seq[LogicalPlan] = child.toSeq
+
+  override def newInstance(): LogicalPlan = copy(output = 
output.map(_.newInstance()))
+}
--- End diff --

We also need to add `override def simpleString`, which is used for `ONE 
line description of this node`. Currently, we output the whole `CatalogTable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70652 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70652/testReport)**
 for PR 16291 at commit 
[`6494cfe`](https://github.com/apache/spark/commit/6494cfe171dce7845d0db8d53f845a1df933d202).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16393: [SPARK-18993] [Build] Revert Split test-tags into...

2016-12-27 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/16393


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16418: [SPARK-18993][BUILD] Unable to build/compile Spark in In...

2016-12-27 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16418
  
This is the workaround I did in my local environment. It works well. LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9759: [SPARK-11753][SQL][test-hadoop2.2] Make allowNonNumericNu...

2016-12-27 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/9759
  
@srowen Unfortunately, this fixing is only included since 2.8.1. Even 
latest maintenance release 2.7.8 doesn't contain it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16415: [SPARK-19007]Speedup and optimize the GradientBoo...

2016-12-27 Thread zdh2292390
Github user zdh2292390 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16415#discussion_r93988463
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala 
---
@@ -329,8 +330,16 @@ private[spark] object GradientBoostedTrees extends 
Logging {
   //   However, the behavior should be reasonable, though not 
optimal.
   baseLearnerWeights(m) = learningRate
 
+  if (pre_predError.getStorageLevel != StorageLevel.NONE ){
--- End diff --

Yes,  pre_predError always refers to the previous predError  and  is not 
used at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16392: [SPARK-18992] [SQL] Move spark.sql.hive.thriftSer...

2016-12-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16392


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16392: [SPARK-18992] [SQL] Move spark.sql.hive.thriftServer.sin...

2016-12-27 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16392
  
LGTM, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70651 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70651/testReport)**
 for PR 16291 at commit 
[`9e5819d`](https://github.com/apache/spark/commit/9e5819df62a77a12bd0b08d59499cb30aa591862).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70651/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16415: [SPARK-19007]Speedup and optimize the GradientBoo...

2016-12-27 Thread zdh2292390
Github user zdh2292390 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16415#discussion_r93987658
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala 
---
@@ -309,6 +309,7 @@ private[spark] object GradientBoostedTrees extends 
Logging {
 
 var m = 1
 var doneLearning = false
+var pre_predError=predError
--- End diff --

Thank you for reminding me  of these  basic rules. I will read the 
contributing guide and  modify the code style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70651 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70651/testReport)**
 for PR 16291 at commit 
[`9e5819d`](https://github.com/apache/spark/commit/9e5819df62a77a12bd0b08d59499cb30aa591862).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16403: [SPARK-18819][CORE] Double byte alignment on ARM platfor...

2016-12-27 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/16403
  
While we ideally expect `static final` is evaluated as a constant by a JIT 
compiler, I do not know the fact. I would appreciate someone would look at the 
binary code generated by OpenJDK since I cannot do it due to some reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70650/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70650 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70650/testReport)**
 for PR 16291 at commit 
[`f2c777d`](https://github.com/apache/spark/commit/f2c777d4b8f2eeca4cfe2c51601e5c96e4424880).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70650 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70650/testReport)**
 for PR 16291 at commit 
[`f2c777d`](https://github.com/apache/spark/commit/f2c777d4b8f2eeca4cfe2c51601e5c96e4424880).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/16291
  
jenkins test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13143: [SPARK-15359] [Mesos] Mesos dispatcher should handle DRI...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13143
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70648/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13143: [SPARK-15359] [Mesos] Mesos dispatcher should handle DRI...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13143
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13143: [SPARK-15359] [Mesos] Mesos dispatcher should handle DRI...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13143
  
**[Test build #70648 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70648/testReport)**
 for PR 13143 at commit 
[`946ad7d`](https://github.com/apache/spark/commit/946ad7dae5ab92b3bf5f8979ed43b7c7de461ec2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16291
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70649/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70649 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70649/testReport)**
 for PR 16291 at commit 
[`162001a`](https://github.com/apache/spark/commit/162001a2cd6fd75cdff22b4e5620f007e483480a).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16291: [SPARK-18838][CORE] Use separate executor service for ea...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16291
  
**[Test build #70649 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70649/testReport)**
 for PR 16291 at commit 
[`162001a`](https://github.com/apache/spark/commit/162001a2cd6fd75cdff22b4e5620f007e483480a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13143: [SPARK-15359] [Mesos] Mesos dispatcher should handle DRI...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13143
  
**[Test build #70648 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70648/testReport)**
 for PR 13143 at commit 
[`946ad7d`](https://github.com/apache/spark/commit/946ad7dae5ab92b3bf5f8979ed43b7c7de461ec2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16370: [SPARK-18960][SQL][SS] Avoid double reading file which i...

2016-12-27 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16370
  
@srowen It looks like @zsxwing has no time to feed back. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16255: [SPARK-18609][SQL]Fix when CTE with Join between ...

2016-12-27 Thread windpiger
Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16255#discussion_r93892249
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -200,6 +200,8 @@ object RemoveAliasOnlyProject extends Rule[LogicalPlan] 
{
 case plan: Project if plan eq proj => plan.child
--- End diff --

sorry, describe it clearly,this is not safe? 
https://github.com/windpiger/spark/blob/0413f9dad4ad1294e3400dc0f42f66529b1b055b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L203-L204



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16411: [SPARK-17984][YARN][Mesos][Deploy][WIP] add executor lau...

2016-12-27 Thread xiaochang-wu
Github user xiaochang-wu commented on the issue:

https://github.com/apache/spark/pull/16411
  
@srowen do you mean #15579?  
1) Fix some character escape issues of #15579 if the command string 
contains some special characters like '
2) Add Standalone and Mesos support as #15579 only support Yarn mode


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13323: [SPARK-15555] [Mesos] Driver with --supervise option can...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13323
  
**[Test build #70647 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70647/testReport)**
 for PR 13323 at commit 
[`e632848`](https://github.com/apache/spark/commit/e632848697477f4ae102ccb5ee60431fe3c61440).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13323: [SPARK-15555] [Mesos] Driver with --supervise option can...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13323
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13323: [SPARK-15555] [Mesos] Driver with --supervise option can...

2016-12-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13323
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/70647/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16403: [SPARK-18819][CORE] Double byte alignment on ARM ...

2016-12-27 Thread michaelkamprath
Github user michaelkamprath commented on a diff in the pull request:

https://github.com/apache/spark/pull/16403#discussion_r93983319
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -244,6 +257,34 @@ public static void throwException(Throwable t) {
   LONG_ARRAY_OFFSET = _UNSAFE.arrayBaseOffset(long[].class);
   FLOAT_ARRAY_OFFSET = _UNSAFE.arrayBaseOffset(float[].class);
   DOUBLE_ARRAY_OFFSET = _UNSAFE.arrayBaseOffset(double[].class);
+
+  // determine whether double access should be aligned.
+  String arch = System.getProperty("os.arch", "");
+  if (arch.matches("^(arm|arm32)")) {
+logger.info(
+"Host platform '{}' requires aligned double access. "+
+"Creating an aligned buffer for unsafe double reads.",
+arch);
+
+// allocate a 2x memory block to ensure buffer used is 8-byte 
aligned. Java
+// objects are always aligned, so we just need to ensure the 
offset is aligned
+// to an 8-byte boundary
+byte[] heapObj = new byte[16];
+long offset = BYTE_ARRAY_OFFSET;
+long bufferSize = 16;
+for (long i = 0; i < 8; ++i ) {
+  if ((offset+i)%8 == 0) {
+logger.debug("Found aligned buffer offset at {} + {}", offset, 
i);
+offset += i;
+bufferSize -= i;
+break;
+  }
--- End diff --

No, we can't, because we would still need the buffer to avoid doing the 
direct double read from  an unaligned memory location on [line 
131](https://github.com/michaelkamprath/spark/blob/30c6c997ca8e864a31d2f4dfa55d47b5aa629596/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java#L131).
 

The reason for finding an aligned offset here is because on ARM7, the 
actual memory address for the `byte[0]` location is not the memory address of 
the `byte[]` object. That starting point for the 0-index item is indicated by 
`Platform.BYTE_ARRAY_OFFSET`, which is the Java overhead memory used for 
managing the `byte[]` object. So if we read from `byte[0]`, that could be 
unaligned because of the Java overhead, hence the reason to find the index in 
the byte buffer where overhead+index is aligned. I am depending on the Java 
behavior of aligning on objects here, the `byte[]` object) for needing to only 
check that the offset is also aligned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2016-12-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/16386
  
Only regarding the comment, 
https://github.com/apache/spark/pull/16386#issuecomment-269386229, I have a 
similar (rather combined) idea that we provide another option such as corrupt 
file name optionally (meaning maybe the column appears only when user 
explicitly set for backwards compatibility), don't add a column by 
`columnNameOfCorruptRecord` with a proper documentation in `wholeFile` mode and 
issue a warning message if `columnNameOfCorruptRecord` is set by user in 
`wholeFile` mode. This is a bit complicated idea that might make users confused 
though. I am not sure if it is the best idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16403: [SPARK-18819][CORE] Double byte alignment on ARM ...

2016-12-27 Thread michaelkamprath
Github user michaelkamprath commented on a diff in the pull request:

https://github.com/apache/spark/pull/16403#discussion_r93982964
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -119,7 +127,12 @@ public static void putFloat(Object object, long 
offset, float value) {
   }
 
   public static double getDouble(Object object, long offset) {
-return _UNSAFE.getDouble(object, offset);
+if ( null == _doubleBuffer) {
+  return _UNSAFE.getDouble(object, offset);
+} else {
+  copyMemory(object, offset, _doubleBuffer.getBaseObject(), 
_doubleBuffer.getBaseOffset(), 8);
+  return _UNSAFE.getDouble(_doubleBuffer.getBaseObject(), 
_doubleBuffer.getBaseOffset());
--- End diff --

I will test this idea and report back. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16403: [SPARK-18819][CORE] Double byte alignment on ARM ...

2016-12-27 Thread michaelkamprath
Github user michaelkamprath commented on a diff in the pull request:

https://github.com/apache/spark/pull/16403#discussion_r93982937
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -244,6 +257,34 @@ public static void throwException(Throwable t) {
   LONG_ARRAY_OFFSET = _UNSAFE.arrayBaseOffset(long[].class);
   FLOAT_ARRAY_OFFSET = _UNSAFE.arrayBaseOffset(float[].class);
   DOUBLE_ARRAY_OFFSET = _UNSAFE.arrayBaseOffset(double[].class);
+
+  // determine whether double access should be aligned.
+  String arch = System.getProperty("os.arch", "");
+  if (arch.matches("^(arm|arm32)")) {
+logger.info(
+"Host platform '{}' requires aligned double access. "+
+"Creating an aligned buffer for unsafe double reads.",
+arch);
+
+// allocate a 2x memory block to ensure buffer used is 8-byte 
aligned. Java
+// objects are always aligned, so we just need to ensure the 
offset is aligned
+// to an 8-byte boundary
+byte[] heapObj = new byte[16];
+long offset = BYTE_ARRAY_OFFSET;
+long bufferSize = 16;
+for (long i = 0; i < 8; ++i ) {
--- End diff --

Simply to match the `offset` variable, which needs to be a long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16403: [SPARK-18819][CORE] Double byte alignment on ARM ...

2016-12-27 Thread michaelkamprath
Github user michaelkamprath commented on a diff in the pull request:

https://github.com/apache/spark/pull/16403#discussion_r93982849
  
--- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java 
---
@@ -44,6 +49,9 @@
   public static final int DOUBLE_ARRAY_OFFSET;
 
   private static final boolean unaligned;
+
+  private static final MemoryBlock _doubleBuffer;
--- End diff --

Admittedly, I am not deeply familiar with how this class is being using 
concurrently within a single JVM. If multiple threads are using it 
concurrently, you are right and a different approach must be found. I was 
trying to avoid the cost of reallocating a buffer with ever call to 
`getDouble()`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13323: [SPARK-15555] [Mesos] Driver with --supervise option can...

2016-12-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13323
  
**[Test build #70647 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70647/testReport)**
 for PR 13323 at commit 
[`e632848`](https://github.com/apache/spark/commit/e632848697477f4ae102ccb5ee60431fe3c61440).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93974759
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93974324
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93981206
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93982097
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93980975
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93974665
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93973973
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93972115
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93972699
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93974634
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93973211
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93972268
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93973041
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93982121
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93982401
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93979950
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93975729
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93975041
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

[GitHub] spark pull request #15211: [SPARK-14709][ML] spark.ml API for linear SVM

2016-12-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15211#discussion_r93981539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -0,0 +1,558 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.classification
+
+import scala.collection.mutable
+
+import breeze.linalg.{DenseVector => BDV}
+import breeze.optimize.{CachedDiffFunction, DiffFunction, OWLQN => 
BreezeOWLQN}
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.internal.Logging
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.linalg.BLAS._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util._
+import org.apache.spark.mllib.linalg.VectorImplicits._
+import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}
+import org.apache.spark.sql.functions.{col, lit}
+
+/** Params for linear SVM Classifier. */
+private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
+  with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
+  with HasThreshold with HasAggregationDepth {
+
+}
+
+/**
+ * :: Experimental ::
+ * Linear SVM Classifier with Hinge Loss and OWLQN optimizer
+ */
+@Since("2.2.0")
+@Experimental
+class LinearSVC @Since("2.2.0") (
+@Since("2.2.0") override val uid: String)
+  extends Classifier[Vector, LinearSVC, LinearSVCModel]
+  with LinearSVCParams with DefaultParamsWritable {
+
+  @Since("2.2.0")
+  def this() = this(Identifiable.randomUID("linearsvc"))
+
+  /**
+   * Set the regularization parameter.
+   * Default is 0.0.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setRegParam(value: Double): this.type = set(regParam, value)
+  setDefault(regParam -> 0.0)
+
+  /**
+   * Set the maximum number of iterations.
+   * Default is 100.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setMaxIter(value: Int): this.type = set(maxIter, value)
+  setDefault(maxIter -> 100)
+
+  /**
+   * Whether to fit an intercept term.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
+  setDefault(fitIntercept -> true)
+
+  /**
+   * Set the convergence tolerance of iterations.
+   * Smaller value will lead to higher accuracy at the cost of more 
iterations.
+   * Default is 1E-6.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setTol(value: Double): this.type = set(tol, value)
+  setDefault(tol -> 1E-6)
+
+  /**
+   * whether to standardize the training features before fitting the model.
+   * Default is true.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setStandardization(value: Boolean): this.type = set(standardization, 
value)
+  setDefault(standardization -> true)
+
+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
+  /**
+   * Set threshold in binary classification, in range [0, 1].
+   *
+   * @group setParam
+   */
+  @Since("2.2.0")
+  def setThreshold(value: Double): this.type = set(threshold, value)
+  setDefault(threshold -> 

  1   2   3   >