date:20170305

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104358658
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -21,6 +21,7 @@ import scala.collection.JavaConverters._
 
 import org.antlr.v4.runtime.{ParserRuleContext, Token}
 import org.antlr.v4.runtime.tree.TerminalNode
+import org.apache.hadoop.fs.Path
--- End diff --

we can remove this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104358620
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
 ---
@@ -883,7 +885,7 @@ abstract class CatalogTestUtils {
 
   def newFunc(): CatalogFunction = newFunc("funcName")
 
-  def newUriForDatabase(): String = 
Utils.createTempDir().toURI.toString.stripSuffix("/")
+  def newUriForDatabase(): URI = new 
URI(Utils.createTempDir().toURI.toString.stripSuffix("/"))
--- End diff --

I think it's ok if we always compare URI with Path, instead of converting 
it to string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73968/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104358453
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
 ---
@@ -553,7 +555,7 @@ abstract class ExternalCatalogSuite extends 
SparkFunSuite with BeforeAndAfterEac
   test("alter partitions") {
 val catalog = newBasicCatalog()
 try {
-  val newLocation = newUriForDatabase()
+  val newLocation = new Path(newUriForDatabase()).toUri
--- End diff --

unnecessary change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104358500
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).getAbsolutePath,
--- End diff --

Can we use URI form here? it seems this one is problematic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #73968 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73968/testReport)**
 for PR 16867 at commit 
[`f30ef46`](https://github.com/apache/spark/commit/f30ef46f7e64416e7b958dffca77fdf6b6d1dfba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread windpiger

Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104358279
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogSuite.scala
 ---
@@ -883,7 +885,7 @@ abstract class CatalogTestUtils {
 
   def newFunc(): CatalogFunction = newFunc("funcName")
 
-  def newUriForDatabase(): String = 
Utils.createTempDir().toURI.toString.stripSuffix("/")
+  def newUriForDatabase(): URI = new 
URI(Utils.createTempDir().toURI.toString.stripSuffix("/"))
--- End diff --

@cloud-fan  after I try to use Path to Compare , I think stripSuffix here 
is the simple way.

Path with a `String` type constructor will be equal when one has a `/`, and 
another does not.
```
scala> val x = new Path("/ab/c/")
x: org.apache.hadoop.fs.Path = /ab/c

scala> val y = new Path("/ab/c")
y: org.apache.hadoop.fs.Path = /ab/c

scala> x == y
res0: Boolean = true
```

Path with a `URI` type constructor will be not equal when one has a `/`, 
and another does not.
```
scala> val x =new URI("/a/b/c/")
x: java.net.URI = /a/b/c/

scala> val y =new URI("/a/b/c")
y: java.net.URI = /a/b/c

scala> x == y
res1: Boolean = false

scala> val x1 =new Path(x)
x1: org.apache.hadoop.fs.Path = /a/b/c/

scala> val y1 =new Path(y)
y1: org.apache.hadoop.fs.Path = /a/b/c

scala> x1 == y1
res2: Boolean = false
```

So just the Path with `String` type can ignore the suffix `/`, then if we 
have a `URI` in hand, and we want to compare with another `URI`, we should  
first transform them to `String` , and use this String to constructor a Path, 
after this two actions, we can compare them with ignore the suffix `/`.

But I think it is more complicate, here we normalize the URI with 
stripSuffix from the Orignal then compare two URI directly, it is more simple.

should we must to convert it to Path to compare?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104358192
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -131,7 +132,7 @@ class SessionCatalog(
* does not contain a scheme, this path will not be changed after the 
default
* FileSystem is changed.
*/
-  private def makeQualifiedPath(path: String): Path = {
+  private def makeQualifiedPath(path: URI): Path = {
--- End diff --

this method can return `URI` too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104357841
  
--- Diff: R/pkg/R/context.R ---
@@ -319,6 +319,34 @@ spark.addFile <- function(path, recursive = FALSE) {
   invisible(callJMethod(sc, "addFile", 
suppressWarnings(normalizePath(path)), recursive))
 }
 
+
+#' Adds a JAR dependency for all tasks to be executed on this SparkContext 
in the future.
+#'
+#' The \code{path} passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+#' filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on 
every worker node.
+#' If \code{addToCurrentClassLoader} is true, add the jar to the current 
threads' classloader. In
+#' general adding to the current threads' class loader will impact all 
other application threads
+#' unless they have explicitly changed their class loader.
+#'
+#' @rdname spark.addJar
+#' @param path The path of the jar to be added
+#' @param addToCurrentClassLoader Whether to add the jar to the current 
driver classloader.
+#' Default is FALSE.
+#' @export
+#' @examples
+#'\dontrun{
+#' spark.addJar("/path/to/something.jar", TRUE)
+#'}
+#' @note spark.addJar since 2.2.0
+spark.addJar <- function(path, addToCurrentClassLoader = FALSE) {
+  sc <- getSparkContext()
+  normalizedPath <- suppressWarnings(normalizePath(path))
+  scala_sc <- callJMethod(sc, "sc")
+  invisible(callJMethod(scala_sc, "addJar", normalizedPath, 
addToCurrentClassLoader))
--- End diff --

This seems failed as below:

```
1. Error: add jar should work and allow usage of the jar on the driver node 
(@test_context.R#174) 
java.lang.IllegalArgumentException: Illegal character in path at index 12: 
string:///C:\Users\appveyor\AppData\Local\Temp\1\RtmpCqEUJL\testjar\sparkrTests\DummyClassForAddJarTest.java
at java.net.URI.create(URI.java:852)
at 
org.apache.spark.TestUtils$.org$apache$spark$TestUtils$$createURI(TestUtils.scala:119)
at 
org.apache.spark.TestUtils$JavaSourceFromString.(TestUtils.scala:123)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-03-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r104358056
  
--- Diff: core/src/main/scala/org/apache/spark/TestUtils.scala ---
@@ -168,6 +168,27 @@ private[spark] object TestUtils {
 createCompiledClass(className, destDir, sourceFile, classpathUrls)
   }
 
+  /** Create a dummy compile jar for a given package, classname.  Jar will 
be placed in destDir */
+  def createDummyJar(destDir: String, packageName: String, className: 
String): String = {
+val srcDir = new File(destDir, packageName)
+srcDir.mkdirs()
+val excSource = new JavaSourceFromString(new File(srcDir, 
className).getAbsolutePath,
+  s"""package $packageName;
+ |
+  |public class $className implements java.io.Serializable {
+ |  public static String helloWorld(String arg) { return "Hello " 
+ arg; }
+ |  public static int addStuff(int arg1, int arg2) { return arg1 + 
arg2; }
+ |}
+""".
+stripMargin)
+val excFile = createCompiledClass(className, srcDir, excSource, 
Seq.empty)
--- End diff --

Can we use URI form here? it seems this one is problematic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104357644
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
 ---
@@ -162,6 +164,28 @@ object CatalogUtils {
 BucketSpec(numBuckets, normalizedBucketCols, normalizedSortCols)
   }
 
+  /**
+   * Convert URI to String.
+   * Since URI.toString does not decode for the uri string, we need to use
+   * Path(uri).toString to decode it.
+   * @param uri the URI of the path
+   * @return the String of the path
+   */
+  def URIToString(uri: Option[URI]): Option[String] = {
+uri.map(new Path(_).toString)
--- End diff --

can you try `java.net.URLDecoder`? I'd like to avoid hadoop dependency on 
this kind of basic functionality.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17149: [SPARK-19257][SQL]location for table/partition/database ...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17149
  
**[Test build #73979 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73979/testReport)**
 for PR 17149 at commit 
[`b6bc466`](https://github.com/apache/spark/commit/b6bc4668f4b9b4cd0161ec58e9dcd54527bed0b7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104357269
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

https://github.com/apache/spark/pull/17068#discussion_r104356866 did not 
show up when I write my comment. I am fine as is. I am not supposed to decide 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17149: [SPARK-19257][SQL]location for table/partition/da...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17149#discussion_r104357082
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala
 ---
@@ -162,6 +164,28 @@ object CatalogUtils {
 BucketSpec(numBuckets, normalizedBucketCols, normalizedSortCols)
   }
 
+  /**
+   * Convert URI to String.
+   * Since URI.toString does not decode for the uri string, we need to use
--- End diff --

`Since URI.toString does not decode the uri string, here we create a hadoop 
Path with the given URI, ad rely on hadoop Path.toString to decode the uri`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73967/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #73967 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73967/testReport)**
 for PR 16867 at commit 
[`712e06e`](https://github.com/apache/spark/commit/712e06e4f851481700f0cebc194891fa57102458).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104356921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

All three patterns I mentioned are being used across the code base. There 
is no style guide for this both in 
https://github.com/databricks/scala-style-guide and 
http://spark.apache.org/contributing.html

In this case, matching new one to other similar ones is a better choice to 
reduce changed lines, rather than doing the opposite. Personal taste might be 
secondary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104356866
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

I don't have a strong preference, this looks fine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17174: [SPARK-19145][SQL] Timestamp to String casting is slowin...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17174
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17174: [SPARK-19145][SQL] Timestamp to String casting is...

2017-03-05 Thread tanejagagan

GitHub user tanejagagan opened a pull request:

https://github.com/apache/spark/pull/17174

[SPARK-19145][SQL] Timestamp to String casting is slowing the query sâ¦

â¦ignificantly

If BinaryComparison has expression with timestamp and string datatype then 
cast string to timestamp if string type expression is foldable. This results in 
order of magnitude performance improvement in query execution

## What changes were proposed in this pull request?
If BinaryComparison has expression with timestamp and string datatype then 
cast string to timestamp if string type expression is foldable. This results in 
order of magnitude performance improvement in query execution

## How was this patch tested?

Added new unit tests to conver the functionality

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tanejagagan/spark branch-19145

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17174.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17174


commit 746b8926747ce85d4d5ecd78a9291a13b15e52d5
Author: gagan taneja 
Date:   2017-03-06T07:26:39Z

[SPARK-19145][SQL] Timestamp to String casting is slowing the query 
significantly

If BinaryComparison has expression with timestamp and string datatype then 
cast string to timestamp if string type expression is foldable. This results in 
order of magnitude performance improvement in query execution

Added new unit tests to conver the functionality




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17136: [SPARK-19783][SQL] Treat longer lengths of tokens...

2017-03-05 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17136#discussion_r104356333
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -246,8 +246,8 @@ test_that("read/write csv as DataFrame", {
   mockLinesCsv <- c("year,make,model,comment,blank",
"\"2012\",\"Tesla\",\"S\",\"No comment\",",
"1997,Ford,E350,\"Go get one now they are going 
fast\",",
-   "2015,Chevy,Volt",
-   "NA,Dummy,Placeholder")
+   "2015,Chevy,Volt,,",
--- End diff --

I probably think that dropping the extra tokens in the **longer** case  is 
an incorrect behaviour by referring the json behaviour. But, I know this change 
could affect current users, so we might need to do something for that, e.g., 
adding a new option to keep the current behaviour. WDYT? cc: @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17171: [SPARK-19830] [SQL] Add parseTableSchema API to P...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17171#discussion_r104356142
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableSchemaParserSuite.scala
 ---
@@ -0,0 +1,85 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.catalyst.parser
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types._
+
+class TableSchemaParserSuite extends SparkFunSuite {
+
+  def parse(sql: String): DataType = 
CatalystSqlParser.parseTableSchema(sql)
+
+  def checkTableSchema(tableSchemaString: String, expectedDataType: 
DataType): Unit = {
+test(s"parse ${tableSchemaString.replace("\n", "")}") {
+  assert(parse(tableSchemaString) === expectedDataType)
+}
+  }
+
+  checkTableSchema("a int", (new StructType).add("a", "int"))
--- End diff --

nit: `new StructType().add...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17149: [SPARK-19257][SQL]location for table/partition/database ...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17149
  
**[Test build #73977 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73977/testReport)**
 for PR 17149 at commit 
[`dc0a37b`](https://github.com/apache/spark/commit/dc0a37b8860ff41fbb7511522d8ceff43965ddd8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17125: [SPARK-19211][SQL] Explicitly prevent Insert into View o...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17125
  
**[Test build #73978 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73978/testReport)**
 for PR 17125 at commit 
[`8d4be05`](https://github.com/apache/spark/commit/8d4be059b458330150ca31e07e2e76a82e5159a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15666: [SPARK-11421] [Core][Python][R] Added ability for addJar...

2017-03-05 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15666
  
So the R appveyor tests failed, since these are only triggered in R changes 
maybe @felixcheung has some ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule ...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16933#discussion_r104355455
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -562,27 +562,43 @@ object CollapseProject extends Rule[LogicalPlan] {
 }
 
 /**
- * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] 
operator combinations
- * by keeping only the one.
- * 1. For adjacent [[Repartition]]s, collapse into the last 
[[Repartition]].
- * 2. For adjacent [[RepartitionByExpression]]s, collapse into the last 
[[RepartitionByExpression]].
- * 3. For a combination of [[Repartition]] and 
[[RepartitionByExpression]], collapse as a single
- *[[RepartitionByExpression]] with the expression and last number of 
partition.
+ * Combines adjacent [[RepartitionOperation]] operators
  */
 object CollapseRepartition extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-// Case 1
-case Repartition(numPartitions, shuffle, Repartition(_, _, child)) =>
-  Repartition(numPartitions, shuffle, child)
-// Case 2
-case RepartitionByExpression(exprs, RepartitionByExpression(_, child, 
_), numPartitions) =>
-  RepartitionByExpression(exprs, child, numPartitions)
-// Case 3
-case Repartition(numPartitions, _, r: RepartitionByExpression) =>
-  r.copy(numPartitions = numPartitions)
-// Case 3
-case RepartitionByExpression(exprs, Repartition(_, _, child), 
numPartitions) =>
-  RepartitionByExpression(exprs, child, numPartitions)
+// Case 1: When a Repartition has a child of Repartition or 
RepartitionByExpression,
+// we can collapse it with the child based on the type of shuffle and 
the specified number
+// of partitions.
+case r @ Repartition(_, _, child: Repartition) =>
+  collapseRepartition(r, child)
+case r @ Repartition(_, _, child: RepartitionByExpression) =>
+  collapseRepartition(r, child)
+// Case 2: When a RepartitionByExpression has a child of Repartition 
or RepartitionByExpression
+// we can remove the child.
+case r @ RepartitionByExpression(_, child: RepartitionByExpression, _) 
=>
+  r.copy(child = child.child)
+case r @ RepartitionByExpression(_, child: Repartition, _) =>
+  r.copy(child = child.child)
+  }
+
+  /**
+   * Collapses the [[Repartition]] with its child 
[[RepartitionOperation]], if possible.
+   * - Case 1 the top [[Repartition]] does not enable shuffle (i.e., 
coalesce API):
+   *   If the last numPartitions is bigger, returns the child node; 
otherwise, keep unchanged.
+   * - Case 2 the top [[Repartition]] enables shuffle (i.e., repartition 
API):
+   *   returns the child node with the last numPartitions.
+   */
+  private def collapseRepartition(r: Repartition, child: 
RepartitionOperation): LogicalPlan = {
+(r.shuffle, child.shuffle) match {
+  case (false, true) => child match {
--- End diff --

why this pattern match? we can just call `child.numPartitions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17149: [SPARK-19257][SQL]location for table/partition/database ...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17149
  
**[Test build #73976 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73976/testReport)**
 for PR 17149 at commit 
[`109e2b5`](https://github.com/apache/spark/commit/109e2b5641dc91157301563134e17670be83341a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule ...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16933#discussion_r104355337
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -562,27 +562,43 @@ object CollapseProject extends Rule[LogicalPlan] {
 }
 
 /**
- * Combines adjacent [[Repartition]] and [[RepartitionByExpression]] 
operator combinations
- * by keeping only the one.
- * 1. For adjacent [[Repartition]]s, collapse into the last 
[[Repartition]].
- * 2. For adjacent [[RepartitionByExpression]]s, collapse into the last 
[[RepartitionByExpression]].
- * 3. For a combination of [[Repartition]] and 
[[RepartitionByExpression]], collapse as a single
- *[[RepartitionByExpression]] with the expression and last number of 
partition.
+ * Combines adjacent [[RepartitionOperation]] operators
  */
 object CollapseRepartition extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-// Case 1
-case Repartition(numPartitions, shuffle, Repartition(_, _, child)) =>
-  Repartition(numPartitions, shuffle, child)
-// Case 2
-case RepartitionByExpression(exprs, RepartitionByExpression(_, child, 
_), numPartitions) =>
-  RepartitionByExpression(exprs, child, numPartitions)
-// Case 3
-case Repartition(numPartitions, _, r: RepartitionByExpression) =>
-  r.copy(numPartitions = numPartitions)
-// Case 3
-case RepartitionByExpression(exprs, Repartition(_, _, child), 
numPartitions) =>
-  RepartitionByExpression(exprs, child, numPartitions)
+// Case 1: When a Repartition has a child of Repartition or 
RepartitionByExpression,
+// we can collapse it with the child based on the type of shuffle and 
the specified number
+// of partitions.
+case r @ Repartition(_, _, child: Repartition) =>
--- End diff --

we can only write one case `case r @ Repartition(_, _, child: 
RepartitionOperation)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17075: [SPARK-19727][SQL] Fix for round function that mo...

2017-03-05 Thread wojtek-szymanski

Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17075#discussion_r104354415
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -223,6 +224,19 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   }
 
   /**
+   * Create new `Decimal` with given precision and scale.
+   *
+   * @return `Some(decimal)` if successful or `None` if overflow would 
occur
+   */
+  private[sql] def toPrecision(
+   precision: Int,
--- End diff --

Actually I did, but I saw so many different styles and I had no idea which 
one is correct. Thanks again for your patience


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16592: [SPARK-19235] [SQL] [TESTS] Enable Test Cases in DDLSuit...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16592
  
**[Test build #73975 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73975/testReport)**
 for PR 16592 at commit 
[`b186dcd`](https://github.com/apache/spark/commit/b186dcddb800a7e1e42b84fbf90da34a93bf7e84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17164: [SPARK-16844][SQL][WIP] Support codegen for sort-based a...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17164
  
**[Test build #73974 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73974/testReport)**
 for PR 17164 at commit 
[`d5cc0f0`](https://github.com/apache/spark/commit/d5cc0f08624448362e535f46b9deea889f2467ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehouse dir t...

2017-03-05 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16290
  
@gatorsmile @cloud-fan @felixcheung I looked at the SharedState code more 
closely and it looks like the only time the warehousePath can be set is when 
the initialization of shared state happens. So I modified the code to set the 
temp dir for SparkR during SparkSession initialization if it has not already 
been set. 

I verified that this works locally and with a unit test -- Is there 
anything I might be missing with this approach ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread wojtek-szymanski

Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104353760
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

I would leave it as it is, since pattern matching still looks a bit clearer 
than conditionals. If minimizing changes is so critical, I can revert the 
previous version here and replace pattern matching with conditionals in my fix 
, @cloud-fan please advise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17075: [SPARK-19727][SQL] Fix for round function that mo...

2017-03-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17075#discussion_r104353273
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -223,6 +224,19 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   }
 
   /**
+   * Create new `Decimal` with given precision and scale.
+   *
+   * @return `Some(decimal)` if successful or `None` if overflow would 
occur
+   */
+  private[sql] def toPrecision(
+   precision: Int,
--- End diff --

4 spaces indention here. please take a look at other methods in spark and 
follow the code style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17173: [SPARK-19832][SQL]DynamicPartitionWriteTask get partitio...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17173
  
**[Test build #73972 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73972/testReport)**
 for PR 17173 at commit 
[`9800a7d`](https://github.com/apache/spark/commit/9800a7d23857ccf7f8c35f99225cece54f3d385a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehouse dir t...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16290
  
**[Test build #73973 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73973/testReport)**
 for PR 16290 at commit 
[`b14c302`](https://github.com/apache/spark/commit/b14c3028e325c7803353aafa81ba37706eba537f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread mridulm

Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16867
  
The cost for median heap could be higher than TreeMap imo - for example, 
the additional dequeue + enqueue when rebalance is required ?
If the cost is high enough, we might want to relook at the PR
For example, 
* if the overhead is not negligible, we would want to disable this when 
speculative execution is disabled.
* depending on how high the cost is, disable it for numTask is below some 
value ?
Essentially, if the benefits from median computation does not outweight the 
cost of maintaining the data structure. Looks at various task cardinality will 
help to arrive at a better judgement.

@kayousterhout , @squito any other comments on how to measure the impact of 
the PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17173: [SPARK-19832][SQL]DynamicPartitionWriteTask get p...

2017-03-05 Thread windpiger

GitHub user windpiger opened a pull request:

https://github.com/apache/spark/pull/17173

[SPARK-19832][SQL]DynamicPartitionWriteTask get partitionPath should escape 
the partition name

## What changes were proposed in this pull request?

Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
parition, we just escape the partition value, not escape the partition name.

this will cause some problems for some  special partition name situation, 
for example :
1) if the partition name contains '%' etc,  there will be two partition 
path created in the filesytem, one is for escaped path like '/path/a%25b=1', 
another is for unescaped path like '/path/a%b=1'.
and the data inserted stored in unescaped path, while the show partitions 
table will return 'a%25b=1' which the partition name is escaped. So here it is 
not consist. And I think the data should be stored in the escaped path in 
filesystem, which Hive2.0.0 also have the same action.

2) if the partition name contains ':', there will throw exception that new 
Path("/path","a:b"), this is illegal which has a colon in the relative path.

```
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
path in absolute URI: a:b
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:88)
  ... 48 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 50 more
```
## How was this patch tested?
unit test added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/windpiger/spark 
fixDatasourceSpecialCharPartitionName

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17173


commit deb63bd9f3586b7f9c2e203f8486d6a7eb49bc72
Author: windpiger 
Date:   2017-03-06T06:57:25Z

[SPARK-19832][SQL]DynamicPartitionWriteTask get partitionPath should escape 
the partition name




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104352100
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/MedianHeap.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.collection.mutable
+
+/**
+ * MedianHeap stores numbers and returns the median by O(1) time 
complexity.
+ * The basic idea is to maintain two heaps: a maxHeap and a minHeap. The 
maxHeap stores
+ * the smaller half of all numbers while the minHeap stores the larger 
half.  The sizes
+ * of two heaps need to be balanced each time when a new number is 
inserted so that their
+ * sizes will not be different by more than 1. Therefore each time when 
findMedian() is
+ * called we check if two heaps have the same size. If they do, we should 
return the
+ * average of the two top values of heaps. Otherwise we return the top of 
the heap which
+ * has one more element.
+ */
+
+private[spark]
+class MedianHeap(implicit val ord: Ordering[Double]) {
+
+  // Stores all the numbers less than the current median in a maxHeap,
+  // i.e median is the maximum, at the root
+  val maxHeap = 
mutable.PriorityQueue.empty[Double](implicitly[Ordering[Double]])
+
+  // Stores all the numbers greater than the current median in a minHeap,
+  // i.e median is the minimum, at the root
+  val minHeap = 
mutable.PriorityQueue.empty[Double](implicitly[Ordering[Double]].reverse)
+
+  // Returns if there is no element in MedianHeap.
+  def isEmpty(): Boolean = {
+maxHeap.isEmpty && minHeap.isEmpty
+  }
+
+  // Size of MedianHeap.
+  def size(): Int = {
+maxHeap.size + minHeap.size
+  }
+
+  // Insert a new number into MedianHeap.
+  def insert(x: Double): Unit = {
+// If both heaps are empty, we arbitrarily insert it into a heap, 
let's say, the minHeap.
+if (isEmpty) {
+  minHeap.enqueue(x)
+} else {
+  // If the number is larger than current median, it should be 
inserted into minHeap,
+  // otherwise maxHeap.
+  if (x > findMedian) {
+minHeap.enqueue(x)
+  } else {
+maxHeap.enqueue(x)
+  }
+}
+rebalance()
+  }
+
+  // Re-balance the heaps.
+  private[this] def rebalance(): Unit = {
+if (minHeap.size - maxHeap.size > 1) {
+  maxHeap.enqueue(minHeap.dequeue())
+}
+if (maxHeap.size - minHeap.size > 1) {
+  minHeap.enqueue(maxHeap.dequeue)
+}
+  }
+
+  // Returns the median of the numbers.
+  def findMedian(): Double = {
+if (isEmpty) {
+  throw new NoSuchElementException("MedianHeap is empty.")
+}
+if (minHeap.size == maxHeap.size) {
+  (minHeap.head + maxHeap.head) / 2.0
--- End diff --

I prefer actual values and not synthetic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104352237
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -754,7 +743,6 @@ private[spark] class TaskSetManager(
 }
 removeRunningTask(tid)
 info.markFinished(state)
-successfulTaskIdsSet -= tid
--- End diff --

Not handling task failure will cause other issues - we need to keep data 
structures consistent - cost is secondary to correctness for a performance PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread mridulm

Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104351990
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/MedianHeapSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.util
+import java.util.NoSuchElementException
+
+import scala.collection.mutable.ArrayBuffer
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+
+class MedianHeapSuite extends SparkFunSuite {
+
+  test("If no numbers in MedianHeap, NoSuchElementException is thrown.") {
+val medianHeap = new MedianHeap()
+var valid = false
+try {
+  medianHeap.findMedian()
+} catch {
+  case e: NoSuchElementException =>
+valid = true
+}
+
+assert(valid)
+  }
+
+  test("Median should be correct when size of MedianHeap is ord or even") {
+val random = new Random()
+val medianHeap1 = new MedianHeap()
+val array1 = new Array[Int](100)
+(0 until 100).foreach {
+  case i =>
+val randomNumber = random.nextInt(1000)
+medianHeap1.insert(randomNumber)
+array1(i) += randomNumber
+}
+util.Arrays.sort(array1)
+assert(medianHeap1.findMedian() === ((array1(49) + array1(50)) / 2.0))
+
+val medianHeap2 = new MedianHeap()
+val array2 = new Array[Int](101)
+(0 until 101).foreach {
+  case i =>
+val randomNumber = random.nextInt(1000)
+medianHeap2.insert(randomNumber)
+array2(i) += randomNumber
+}
+util.Arrays.sort(array2)
+assert(medianHeap2.findMedian() === array2(50))
+  }
+
+  test("Size of Median should be correct though there are duplicated 
numbers inside.") {
--- End diff --

Yes, but no validation of the median is being done.
Either enhance this test to validate the media, or add a new test to do so 
(if this is to test only duplicated size)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17075: [SPARK-19727][SQL] Fix for round function that mo...

2017-03-05 Thread wojtek-szymanski

Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17075#discussion_r104351913
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala ---
@@ -422,3 +434,4 @@ class MathFunctionsSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(df.selectExpr("positive(b)"), Row(-1))
   }
 }
+case class NumericRow(value : BigDecimal)
--- End diff --

replaced with Tuple1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17075: [SPARK-19727][SQL] Fix for round function that mo...

2017-03-05 Thread wojtek-szymanski

Github user wojtek-szymanski commented on a diff in the pull request:

https://github.com/apache/spark/pull/17075#discussion_r104351867
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -223,6 +224,19 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
   }
 
   /**
+   * Create new `Decimal` with given precision and scale.
+   *
+   * @return `Some(decimal)` if successful or `None` if overflow would 
occur
+   */
+  private[sql] def toPrecision(
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17087: [SPARK-19372][SQL] Fix throwing a Java exception at df.f...

2017-03-05 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/17087
  
@davies, could you please review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17172
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73971/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17172
  
**[Test build #73971 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73971/testReport)**
 for PR 17172 at commit 
[`d8b5f8d`](https://github.com/apache/spark/commit/d8b5f8d839d5c3388244cf2a6dcf4494d927145f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17172
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17172
  
**[Test build #73971 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73971/testReport)**
 for PR 17172 at commit 
[`d8b5f8d`](https://github.com/apache/spark/commit/d8b5f8d839d5c3388244cf2a6dcf4494d927145f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17172: [SPARK-19008][SQL] Improve performance of Dataset...

2017-03-05 Thread kiszk

GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/17172

[SPARK-19008][SQL] Improve performance of Dataset.map by eliminating 
boxing/unboxing

## What changes were proposed in this pull request?

This PR improve performance of Dataset.map() for primitive types by 
removing boxing/unbox operations.

Current Catalyst generates a method call to a `apply()` method of an 
anonymous function written in Scala. The types of an argument and return value 
are `java.lang.Object`. As a result, each method call for a primitive value 
involves a pair of unboxing and boxing for calling this `apply()` method and a 
pair of boxing and unboxing for returning from this `apply()` method.

This PR directly calls a specialized version of a `apply()` method without 
boxing and unboxing. For example, if types of an arguments ant return value is 
`int`, this PR generates a method call to `apply$mcII$sp`. This PR supports any 
combination of `Int`, `Long`, `Float`, and `Double`.


The following is a benchmark result using [this 
program](https://github.com/apache/spark/pull/16391/files) with 4.7x. Here is a 
Dataset part of this program.

Without this PR
```
OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14 on Linux 
4.4.0-47-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
back-to-back map:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


RDD   1923 / 1952 52.0  
19.2   1.0X
DataFrame  526 /  548190.2  
 5.3   3.7X
Dataset   3094 / 3154 32.3  
30.9   0.6X
```

With this PR
```
OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14 on Linux 
4.4.0-47-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
back-to-back map:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


RDD   1883 / 1892 53.1  
18.8   1.0X
DataFrame  502 /  642199.1  
 5.0   3.7X
Dataset657 /  784152.2  
 6.6   2.9X
```

```java
  def backToBackMap(spark: SparkSession, numRows: Long, numChains: Int): 
Benchmark = {
import spark.implicits._
val rdd = spark.sparkContext.range(0, numRows)
val ds = spark.range(0, numRows)
val func = (l: Long) => l + 1
val benchmark = new Benchmark("back-to-back map", numRows)
...
benchmark.addCase("Dataset") { iter =>
  var res = ds.as[Long]
  var i = 0
  while (i < numChains) {
res = res.map(func)
i += 1
  }
  res.queryExecution.toRdd.foreach(_ => Unit)
}
benchmark
  }
```


A motivating example
```java
Seq(1, 2, 3).toDS.map(i => i * 7).show
```

Generated code without this PR
```java
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private UnsafeRow deserializetoobject_result;
/* 010 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
deserializetoobject_holder;
/* 011 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
deserializetoobject_rowWriter;
/* 012 */   private int mapelements_argValue;
/* 013 */   private UnsafeRow mapelements_result;
/* 014 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
mapelements_holder;
/* 015 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
mapelements_rowWriter;
/* 016 */   private UnsafeRow serializefromobject_result;
/* 017 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
serializefromobject_holder;
/* 018 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
serializefromobject_rowWriter;
/* 019 */
/* 020 */   public GeneratedIterator(Object[] references) {
/* 021 */ this.references = references;
/* 022 */   }
/* 023 */
/* 024 */   public void init(int index, scala.collection.Iterator[] inputs) 
{
/* 025 */ partitionIndex = index;
/* 026 */ this.inputs = inputs;
/* 027 */ inputadapter_input = inputs[0];
/* 028 */

[GitHub] spark pull request #16726: [SPARK-19390][SQL] Replace the unnecessary usages...

2017-03-05 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/16726


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16933: [SPARK-19601] [SQL] Fix CollapseRepartition rule to pres...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16933
  
cc @cloud-fan @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17171: [SPARK-19830] [SQL] Add parseTableSchema API to ParserIn...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17171
  
cc @hvanhovell @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17055: [SPARK-19723][SQL]create datasource table with an...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17055#discussion_r104351251
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1588,8 +1588,102 @@ class HiveDDLSuite
 }
   }
 
+  test("create datasource table with a non-existing location") {
--- End diff --

Let me try to combine the HiveDDLSuite.scala and DDLSuite.scala. Otherwise, 
the test cases need to be duplicated in every PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17055: [SPARK-19723][SQL]create datasource table with an...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17055#discussion_r104351051
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -1950,8 +1950,53 @@ class DDLSuite extends QueryTest with 
SharedSQLContext with BeforeAndAfterEach {
 }
   }
 
+  test("create datasource table with a non-existing location") {
+withTable("t", "t1") {
+  withTempDir {
+dir =>
+  dir.delete()
+  spark.sql(
+s"""
+   |CREATE TABLE t(a int, b int)
+   |USING parquet
+   |LOCATION '$dir'
+ """.stripMargin)
--- End diff --

The test case is already pretty long. Please reduce the sql statement to a 
single line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17055: [SPARK-19723][SQL]create datasource table with an...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17055#discussion_r104351006
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -1950,8 +1950,53 @@ class DDLSuite extends QueryTest with 
SharedSQLContext with BeforeAndAfterEach {
 }
   }
 
+  test("create datasource table with a non-existing location") {
+withTable("t", "t1") {
+  withTempDir {
+dir =>
--- End diff --

Nit: style issue. the above two lines can be combined together.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17055: [SPARK-19723][SQL]create datasource table with an...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17055#discussion_r104350967
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -1950,8 +1950,53 @@ class DDLSuite extends QueryTest with 
SharedSQLContext with BeforeAndAfterEach {
 }
   }
 
+  test("create datasource table with a non-existing location") {
+withTable("t", "t1") {
+  withTempDir {
+dir =>
+  dir.delete()
--- End diff --

`withTempPath`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17171: [SPARK-19830] [SQL] Add parseTableSchema API to ParserIn...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17171
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73965/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17171: [SPARK-19830] [SQL] Add parseTableSchema API to ParserIn...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17171
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17171: [SPARK-19830] [SQL] Add parseTableSchema API to ParserIn...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17171
  
**[Test build #73965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73965/testReport)**
 for PR 17171 at commit 
[`50f74d2`](https://github.com/apache/spark/commit/50f74d26ec1bc37c5f5bea054da60a1910778e46).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17068: [SPARK-19709][SQL] Read empty file with CSV data ...

2017-03-05 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17068#discussion_r104350586
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala
 ---
@@ -190,28 +194,28 @@ object WholeFileCSVDataSource extends CSVDataSource {
   sparkSession: SparkSession,
   inputPaths: Seq[FileStatus],
   parsedOptions: CSVOptions): Option[StructType] = {
-val csv: RDD[PortableDataStream] = createBaseRdd(sparkSession, 
inputPaths, parsedOptions)
-val maybeFirstRow: Option[Array[String]] = csv.flatMap { lines =>
+val csv = createBaseRdd(sparkSession, inputPaths, parsedOptions)
+csv.flatMap { lines =>
   UnivocityParser.tokenizeStream(
 
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-false,
+shouldDropHeader = false,
 new CsvParser(parsedOptions.asParserSettings))
-}.take(1).headOption
-
-if (maybeFirstRow.isDefined) {
-  val firstRow = maybeFirstRow.get
-  val caseSensitive = 
sparkSession.sessionState.conf.caseSensitiveAnalysis
-  val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions)
-  val tokenRDD = csv.flatMap { lines =>
-UnivocityParser.tokenizeStream(
-  
CodecStreams.createInputStreamWithCloseResource(lines.getConfiguration, 
lines.getPath()),
-  parsedOptions.headerFlag,
-  new CsvParser(parsedOptions.asParserSettings))
-  }
-  Some(CSVInferSchema.infer(tokenRDD, header, parsedOptions))
-} else {
-  // If the first row could not be read, just return the empty schema.
-  Some(StructType(Nil))
+}.take(1).headOption match {
--- End diff --

IMHO, `Option.isDefine` with `Option.get`, `Option.map` with 
`Option.getOrElse` and `Option` with  `match case Some... case None` all might 
be fine. But, how about minimising the change by matching the above one to 
`Option.isDefine` with `Option.get`? Then, it would not require the changes 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17068: [SPARK-19709][SQL] Read empty file with CSV data source

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17068
  
**[Test build #73970 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73970/testReport)**
 for PR 17068 at commit 
[`e7faa80`](https://github.com/apache/spark/commit/e7faa80f9628519cb12458dae2fd26c93ac6f87b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16782: [SPARK-19348][PYTHON] PySpark keyword_only decorator is ...

2017-03-05 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/16782
  
Sure, I can do a backport @jkbradley, will ping you when ready

On Mar 3, 2017 4:46 PM, "asfgit"  wrote:

> Closed #16782  via 44281ca
> 

> .
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute the
> thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17097: [SPARK-19765][SQL] UNCACHE TABLE should un-cache ...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17097#discussion_r104347576
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoDataSourceCommand.scala
 ---
@@ -42,8 +42,9 @@ case class InsertIntoDataSourceCommand(
 val df = 
sparkSession.internalCreateDataFrame(data.queryExecution.toRdd, 
logicalRelation.schema)
 relation.insert(df, overwrite)
 
-// Invalidate the cache.
-sparkSession.sharedState.cacheManager.invalidateCache(logicalRelation)
+// Re-cache all cached plans(including this relation itself, if it's 
cached) that refer to this
+// data source relation.
+sparkSession.sharedState.cacheManager.recacheByPlan(sparkSession, 
logicalRelation)
--- End diff --

In the current design, users need to re-cache the queries by themselves.

After this change, insertion could be super slow. Each insert could trigger 
the recache of many involved cached data, each of which could be very complex 
and expensive. That is a trade-off. Although we keep the data 
correctness/consistence, we could sacrifice the performance/user experience. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread actuaryzhang

Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16729
  
@felixcheung Sorry for taking so long for this update. 

I think your first suggestion makes most sense, i.e., we do not expose the 
internal `tweedie`.
When `statmod` is loaded, users can use `tweedie` directly (from statmod); 
otherwise, they can use `SparkR::tweedie` which has the same syntax. 

I have made this to work. The following shows it now works both when 
statmod is not loaded (using `SparkR:::tweedie`) and when statmod is loaded 
(using `tweedie`).

Let me know if there is any other issues. Thanks. 

```
training <- suppressWarnings(createDataFrame(iris))
model1 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
 family = SparkR:::tweedie(var.power = 1.2, link.power 
= 1.0))
summary(model1)$coefficients

 Estimate Std. Errort value Pr(>|t|)
(Intercept) 1.7009666 0.22970461   7.405017 9.638512e-12
Sepal_Length0.3436703 0.04518882   7.605206 3.200329e-12
Species_versicolor -0.9703190 0.07090188 -13.685377 0.00e+00
Species_virginica  -0.9852650 0.09129919 -10.791607 0.00e+00

library(statmod)
model2 <- spark.glm(training, Sepal_Width ~ Sepal_Length + Species,
 family = tweedie(var.power = 1.2, link.power = 1.0))
summary(model2)$coefficients
 Estimate Std. Errort value Pr(>|t|)
(Intercept) 1.7009666 0.22970461   7.405017 9.638512e-12
Sepal_Length0.3436703 0.04518882   7.605206 3.200329e-12
Species_versicolor -0.9703190 0.07090188 -13.685377 0.00e+00
Species_virginica  -0.9852650 0.09129919 -10.791607 0.00e+00
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17110: [SPARK-19635][ML] DataFrame-based API for chi square tes...

2017-03-05 Thread imatiach-msft

Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/17110
  
cool, I'll hold off on reviewing this for now then


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17170: [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17170
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73966/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17170: [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17170
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17170: [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17170
  
**[Test build #73966 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73966/testReport)**
 for PR 17170 at commit 
[`6554384`](https://github.com/apache/spark/commit/65543840b347680123f5a478c0d8454b2a08482f).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  class FPGrowthWrapperReader extends MLReader[FPGrowthWrapper] `
  * `class FPGrowthWrapperWriter(instance: FPGrowthWrapper) extends 
MLWriter `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16867
  
@mridulm 
Thanks a lot for your comments. I did a test with `TreeSet` previously with 
100k tasks. I calculate the time spent on insertion. The results are:  372ms, 
362ms, 458ms, 429ms, 363ms, which I think might be acceptable, though the time 
complexity is O(log n).
It's great if you can give more advice and I'm glad to try more ideas and 
do measurements :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-05 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r104346324
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,274 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeSet, 
Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) if 
!j.ordered =>
+  reorder(j, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) if !j.ordered =>
+  reorder(j, j.outputSet)
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty) {
+JoinReorderDP(conf, items, conditions, 
output).search().getOrElse(plan)
+  } else {
+plan
+  }
+// Set all inside joins ordered.
+setOrdered(result)
+result
+  }
+
+  /**
+   * Extract inner joinable items and join conditions.
+   * This method works for bushy trees and left/right deep trees.
+   */
+  def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], 
Set[Expression]) = plan match {
+case j @ Join(left, right, _: InnerLike, cond) =>
+  val (leftPlans, leftConditions) = extractInnerJoins(left)
+  val (rightPlans, rightConditions) = extractInnerJoins(right)
+  (leftPlans ++ rightPlans, 
cond.map(splitConjunctivePredicates).getOrElse(Nil).toSet ++
--- End diff --

Sorry can you explain a little more here? Why should I canonicalize?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73969 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73969/testReport)**
 for PR 16826 at commit 
[`3ee271f`](https://github.com/apache/spark/commit/3ee271f14dde66ee1aefeb4f463c36b514e4f28a).
 * This patch **fails Scala style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16826
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73969/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16826: [SPARK-19540][SQL] Add ability to clone SparkSession whe...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #73969 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73969/testReport)**
 for PR 16826 at commit 
[`3ee271f`](https://github.com/apache/spark/commit/3ee271f14dde66ee1aefeb4f463c36b514e4f28a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16867
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73955/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17138: [SPARK-17080] [SQL] join reorder

2017-03-05 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17138#discussion_r104346049
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
@@ -0,0 +1,274 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{And, AttributeSet, 
Expression, PredicateHelper}
+import org.apache.spark.sql.catalyst.plans.{Inner, InnerLike}
+import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+
+/**
+ * Cost-based join reorder.
+ * We may have several join reorder algorithms in the future. This class 
is the entry of these
+ * algorithms, and chooses which one to use.
+ */
+case class CostBasedJoinReorder(conf: CatalystConf) extends 
Rule[LogicalPlan] with PredicateHelper {
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.cboEnabled || !conf.joinReorderEnabled) {
+  plan
+} else {
+  plan transform {
+case p @ Project(projectList, j @ Join(_, _, _: InnerLike, _)) if 
!j.ordered =>
+  reorder(j, p.outputSet)
+case j @ Join(_, _, _: InnerLike, _) if !j.ordered =>
+  reorder(j, j.outputSet)
+  }
+}
+  }
+
+  def reorder(plan: LogicalPlan, output: AttributeSet): LogicalPlan = {
+val (items, conditions) = extractInnerJoins(plan)
+val result =
+  if (items.size > 2 && items.size <= conf.joinReorderDPThreshold && 
conditions.nonEmpty) {
+JoinReorderDP(conf, items, conditions, 
output).search().getOrElse(plan)
+  } else {
+plan
+  }
+// Set all inside joins ordered.
+setOrdered(result)
+result
+  }
+
+  /**
+   * Extract inner joinable items and join conditions.
+   * This method works for bushy trees and left/right deep trees.
+   */
+  def extractInnerJoins(plan: LogicalPlan): (Seq[LogicalPlan], 
Set[Expression]) = plan match {
+case j @ Join(left, right, _: InnerLike, cond) =>
+  val (leftPlans, leftConditions) = extractInnerJoins(left)
+  val (rightPlans, rightConditions) = extractInnerJoins(right)
+  (leftPlans ++ rightPlans, 
cond.map(splitConjunctivePredicates).getOrElse(Nil).toSet ++
+leftConditions ++ rightConditions)
+case Project(_, j @ Join(left, right, _: InnerLike, cond)) =>
--- End diff --

My goal is dealing with only consecutive joins, but due to column pruning, 
there could be Projects that break join chains. So I only want to support joins 
and projects from column pruning here.
But I didn't realize other projects like you mentioned above.
Then what about we recurse only when the project list are all attributes 
(not compound expression like `a+b`)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #73955 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73955/testReport)**
 for PR 16867 at commit 
[`1fac678`](https://github.com/apache/spark/commit/1fac678ec36fc728275ce8764581be77d5739f37).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MedianHeap(implicit val ord: Ordering[Double]) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #73968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73968/testReport)**
 for PR 16867 at commit 
[`f30ef46`](https://github.com/apache/spark/commit/f30ef46f7e64416e7b958dffca77fdf6b6d1dfba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16729
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16729
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73964/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16729
  
**[Test build #73964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73964/testReport)**
 for PR 16729 at commit 
[`ef65adc`](https://github.com/apache/spark/commit/ef65adc69b710366266abf5df6f458af3a66a3bf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17125: [SPARK-19211][SQL] Explicitly prevent Insert into...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17125#discussion_r104345660
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -604,7 +604,13 @@ class Analyzer(
 
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
   case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
-i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+val newTable = EliminateSubqueryAliases(lookupTableFromCatalog(u))
+// Inserting into a view is not allowed, we should throw an 
AnalysisException.
+if (newTable.isInstanceOf[View]) {
+  u.failAnalysis(s"${newTable.asInstanceOf[View].desc.identifier} 
is a view, inserting " +
+s"into a view is not allowed")
+}
+i.copy(table = newTable)
--- End diff --

How about?

```Scala
lookupTableFromCatalog(u).canonicalized match {
  case v: View =>
u.failAnalysis(s"Inserting into a view is not allowed. View: 
${v.desc.identifier}.")
  case other => i.copy(table = other)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17125: [SPARK-19211][SQL] Explicitly prevent Insert into...

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17125#discussion_r104345227
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -604,7 +604,13 @@ class Analyzer(
 
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
   case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
-i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+val newTable = EliminateSubqueryAliases(lookupTableFromCatalog(u))
+// Inserting into a view is not allowed, we should throw an 
AnalysisException.
+if (newTable.isInstanceOf[View]) {
+  u.failAnalysis(s"${newTable.asInstanceOf[View].desc.identifier} 
is a view, inserting " +
+s"into a view is not allowed")
--- End diff --

Nit: `s"into a view is not allowed"` -> `"into a view is not allowed"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17152: [SPARK-18389][SQL] Disallow cyclic view reference

2017-03-05 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17152
  
Yeah. The temporary view does not have such an issue, because we did not 
change it. My typo. 

What I mean is `CREATE OR REPLACE VIEW`. `AlterViewAsCommand` does not 
cover that code path. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16867: [SPARK-16929] Improve performance when check speculatabl...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16867
  
**[Test build #73967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73967/testReport)**
 for PR 16867 at commit 
[`712e06e`](https://github.com/apache/spark/commit/712e06e4f851481700f0cebc194891fa57102458).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16696: [SPARK-19350] [SQL] Cardinality estimation of Limit and ...

2017-03-05 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/16696
  
@cloud-fan Does this look good to you now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17125: [SPARK-19211][SQL] Explicitly prevent Insert into View o...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17125
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17125: [SPARK-19211][SQL] Explicitly prevent Insert into View o...

2017-03-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17125
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73958/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17125: [SPARK-19211][SQL] Explicitly prevent Insert into View o...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17125
  
**[Test build #73958 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73958/testReport)**
 for PR 17125 at commit 
[`9af2d7e`](https://github.com/apache/spark/commit/9af2d7e4c249da15711b47936693687db5e0658b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread jinxing64

Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104344524
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -754,7 +743,6 @@ private[spark] class TaskSetManager(
 }
 removeRunningTask(tid)
 info.markFinished(state)
-successfulTaskIdsSet -= tid
--- End diff --

Removing duration from `successfulTaskDurations` is quite time 
consuming(O(n)) now. We just use `successfulTaskDurations` to generate the 
median duration. I might hesitate to do the remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread jinxing64

Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104344529
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -754,7 +743,6 @@ private[spark] class TaskSetManager(
 }
 removeRunningTask(tid)
 info.markFinished(state)
-successfulTaskIdsSet -= tid
--- End diff --

Removing duration from `successfulTaskDurations` is quite time 
consuming(O(n)) now. We just use `successfulTaskDurations` to generate the 
median duration. I might hesitate to do the remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread jinxing64

Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104344274
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/MedianHeap.scala ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.collection.mutable
+
+/**
+ * MedianHeap stores numbers and returns the median by O(1) time 
complexity.
+ * The basic idea is to maintain two heaps: a maxHeap and a minHeap. The 
maxHeap stores
+ * the smaller half of all numbers while the minHeap stores the larger 
half.  The sizes
+ * of two heaps need to be balanced each time when a new number is 
inserted so that their
+ * sizes will not be different by more than 1. Therefore each time when 
findMedian() is
+ * called we check if two heaps have the same size. If they do, we should 
return the
+ * average of the two top values of heaps. Otherwise we return the top of 
the heap which
+ * has one more element.
+ */
+
+private[spark]
+class MedianHeap(implicit val ord: Ordering[Double]) {
+
+  // Stores all the numbers less than the current median in a maxHeap,
+  // i.e median is the maximum, at the root
+  val maxHeap = 
mutable.PriorityQueue.empty[Double](implicitly[Ordering[Double]])
+
+  // Stores all the numbers greater than the current median in a minHeap,
+  // i.e median is the minimum, at the root
+  val minHeap = 
mutable.PriorityQueue.empty[Double](implicitly[Ordering[Double]].reverse)
+
+  // Returns if there is no element in MedianHeap.
+  def isEmpty(): Boolean = {
+maxHeap.isEmpty && minHeap.isEmpty
+  }
+
+  // Size of MedianHeap.
+  def size(): Int = {
+maxHeap.size + minHeap.size
+  }
+
+  // Insert a new number into MedianHeap.
+  def insert(x: Double): Unit = {
+// If both heaps are empty, we arbitrarily insert it into a heap, 
let's say, the minHeap.
+if (isEmpty) {
+  minHeap.enqueue(x)
+} else {
+  // If the number is larger than current median, it should be 
inserted into minHeap,
+  // otherwise maxHeap.
+  if (x > findMedian) {
+minHeap.enqueue(x)
+  } else {
+maxHeap.enqueue(x)
+  }
+}
+rebalance()
+  }
+
+  // Re-balance the heaps.
+  private[this] def rebalance(): Unit = {
+if (minHeap.size - maxHeap.size > 1) {
+  maxHeap.enqueue(minHeap.dequeue())
+}
+if (maxHeap.size - minHeap.size > 1) {
+  minHeap.enqueue(maxHeap.dequeue)
+}
+  }
+
+  // Returns the median of the numbers.
+  def findMedian(): Double = {
+if (isEmpty) {
+  throw new NoSuchElementException("MedianHeap is empty.")
+}
+if (minHeap.size == maxHeap.size) {
+  (minHeap.head + maxHeap.head) / 2.0
--- End diff --

I think average value is the definition of 'median' when there are even 
numbers. Maybe it's better to keep it as `(minHeap.head + maxHeap.head) / 2.0` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16867: [SPARK-16929] Improve performance when check spec...

2017-03-05 Thread jinxing64

Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16867#discussion_r104344072
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/MedianHeapSuite.scala ---
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import java.util
+import java.util.NoSuchElementException
+
+import scala.collection.mutable.ArrayBuffer
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+
+class MedianHeapSuite extends SparkFunSuite {
+
+  test("If no numbers in MedianHeap, NoSuchElementException is thrown.") {
+val medianHeap = new MedianHeap()
+var valid = false
+try {
+  medianHeap.findMedian()
+} catch {
+  case e: NoSuchElementException =>
+valid = true
+}
+
+assert(valid)
+  }
+
+  test("Median should be correct when size of MedianHeap is ord or even") {
+val random = new Random()
+val medianHeap1 = new MedianHeap()
+val array1 = new Array[Int](100)
+(0 until 100).foreach {
+  case i =>
+val randomNumber = random.nextInt(1000)
+medianHeap1.insert(randomNumber)
+array1(i) += randomNumber
+}
+util.Arrays.sort(array1)
+assert(medianHeap1.findMedian() === ((array1(49) + array1(50)) / 2.0))
+
+val medianHeap2 = new MedianHeap()
+val array2 = new Array[Int](101)
+(0 until 101).foreach {
+  case i =>
+val randomNumber = random.nextInt(1000)
+medianHeap2.insert(randomNumber)
+array2(i) += randomNumber
+}
+util.Arrays.sort(array2)
+assert(medianHeap2.findMedian() === array2(50))
+  }
+
+  test("Size of Median should be correct though there are duplicated 
numbers inside.") {
--- End diff --

random.nextInt(100) returns value between 0(inclusive) and the specified 
value(exclusive). If I call it 1000 times, there must be duplicates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17170: [SPARK-19825][WIP][R][ML] spark.ml R API for FPGrowth

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17170
  
**[Test build #73966 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73966/testReport)**
 for PR 17170 at commit 
[`6554384`](https://github.com/apache/spark/commit/65543840b347680123f5a478c0d8454b2a08482f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16997: Updated the Spark SQL Programming guide with Custom obje...

2017-03-05 Thread HarshSharma8

Github user HarshSharma8 commented on the issue:

https://github.com/apache/spark/pull/16997
  
Sure, and thanks for kind attention to this pull request.


Thank You


Best Regards |
*Harsh Sharma*
Sr. Software Consultant
Knoldus Software LLP
FB  | Twitter
 | LinkedIn

harshs...@gmail.com
Skype*: khandal60*
*+91-8447307237*

On Sun, Mar 5, 2017 at 10:13 PM, Sean Owen  wrote:

> This still has formatting and text problems. I'm sorry I don't think I can
> go around again for this when it's not an important change, and I'd like 
to
> close this.
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17171: [SPARK-19830] [SQL] Add parseTableSchema API to ParserIn...

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17171
  
**[Test build #73965 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73965/testReport)**
 for PR 17171 at commit 
[`50f74d2`](https://github.com/apache/spark/commit/50f74d26ec1bc37c5f5bea054da60a1910778e46).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17171: [SPARK-19830] [SQL] Add parseTableSchema API to P...

2017-03-05 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/17171

[SPARK-19830] [SQL] Add parseTableSchema API to ParserInterface

### What changes were proposed in this pull request?

Specifying the table schema in DDL formats is needed for different 
scenarios. For example, 
- [specifying the schema in SQL function `from_json` using DDL 
formats](https://issues.apache.org/jira/browse/SPARK-19637), which is suggested 
by @marmbrus , 
- [specifying the customized JDBC data 
types](https://github.com/apache/spark/pull/16209). 

These two PRs need users to use the JSON format to specify the table 
schema. This is not user friendly. 

This PR is to provide a `parseTableSchema` API in `ParserInterface`. 

### How was this patch tested?
Added a test suite `TableSchemaParserSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark parseDDLStmt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17171.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17171


commit 50f74d26ec1bc37c5f5bea054da60a1910778e46
Author: Xiao Li 
Date:   2017-03-06T04:18:20Z

add parseTableSchema API




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-03-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16729
  
**[Test build #73964 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73964/testReport)**
 for PR 16729 at commit 
[`ef65adc`](https://github.com/apache/spark/commit/ef65adc69b710366266abf5df6f458af3a66a3bf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 359 matches

Mail list logo