[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22960
  
Ur, maybe, I'm not clear to the point. The refactoring scope of this PR is 
limited to the new tests here.
```
test("from_csv uses DDL strings for defining a schema - java")
test("roundtrip to_csv -> from_csv")
test("roundtrip from_csv -> to_csv")
test("infers schemas of a CSV string and pass to to from_csv")
test("Support to_csv in SQL")
test("Support from_csv in SQL")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20944
  
Please describe manual tests and how it relates to actual usecase.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22960
  
Yes. It would be great if we do that in this PR.

When I did the similar thing for ORC (`port tests from Parquet to ORC`, 
`port from old ORC to new ORC`). I received the same comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r231404180
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

Yup.. only literal works but columns don't work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22921: [SPARK-25908][CORE][SQL] Remove old deprecated it...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22921#discussion_r231403827
  
--- Diff: R/pkg/R/functions.R ---
@@ -319,6 +319,27 @@ setMethod("acos",
 column(jc)
   })
 
+#' @details
+#' \code{approx_count_distinct}: Returns the approximate number of 
distinct items in a group.
+#'
+#' @rdname column_aggregate_functions
+#' @aliases approx_count_distinct approx_count_distinct,Column-method
+#' @examples
+#'
+#' \dontrun{
+#' head(select(df, approx_count_distinct(df$gear)))
+#' head(select(df, approx_count_distinct(df$gear, 0.02)))
+#' head(select(df, countDistinct(df$gear, df$cyl)))
+#' head(select(df, n_distinct(df$gear)))
+#' head(distinct(select(df, "gear")))}
--- End diff --

we only need one set - they both are `@rdname column_aggregate_functions` 
so will duplicate all other examples


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r231403096
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

you are saying this `select(df, schema_of_csv(df$schemaCol))` is not 
allowed?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22960
  
**[Test build #98542 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98542/testReport)**
 for PR 22960 at commit 
[`1d3a31b`](https://github.com/apache/spark/commit/1d3a31b478622a8e76dfeef0f71973aa71730859).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20944: [SPARK-23831][SQL] Add org.apache.derby to IsolatedClien...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20944
  
Sorry, why was this change required? I don't see 
https://github.com/apache/spark/pull/20944#issuecomment-379525776 is addressed 
Can you elaborate please? Why do we make `org.apache.derby` as shared?

Ideally, minor or maintenance versions of `derby` can be dumped up, and 
they shouldn't be shared unless there's a strong reason to keep it shared, for 
instance, making class resolution failed. How did you reproduce this and why 
the unit test is not added?

I found an actual issue while working on Apache Livy Spark 2.4 support. I 
am still investigating how it relates with the test failures but at the very 
list I see this specific commit matters since Apache Livy unittests pass 
without this specific commit.

Adding @vanzin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402726
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
+  stopifnot(require("withr", quietly = TRUE))
+  numPartitions <- if (!is.null(numPartitions)) {
+numToInt(numPartitions)
+  } else {
+1
+  }
+  fileName <- tempfile()
--- End diff --

might need to give it a dir prefix to use - the tempfile default is not 
CRAN compliant and possibly some ACL issue


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402235
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
--- End diff --

btw, is it worthwhile to check the arrow package version?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402297
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -172,15 +196,17 @@ getDefaultSqlSource <- function() {
 createDataFrame <- function(data, schema = NULL, samplingRatio = 1.0,
 numPartitions = NULL) {
   sparkSession <- getSparkSession()
-
+  conf <- callJMethod(sparkSession, "conf")
+  arrowEnabled <- tolower(callJMethod(conf, "get", 
"spark.sql.execution.arrow.enabled")) == "true"
--- End diff --

I think you can use sparkR.conf


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231402063
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
+  stopifnot(require("withr", quietly = TRUE))
--- End diff --

is it possible to not depend on this withr? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization fr...

2018-11-06 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22954#discussion_r231401994
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -147,6 +147,30 @@ getDefaultSqlSource <- function() {
   l[["spark.sql.sources.default"]]
 }
 
+writeToTempFileInArrow <- function(rdf, numPartitions) {
+  stopifnot(require("arrow", quietly = TRUE))
--- End diff --

perhaps best to add a clearer error message?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-06 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22960
  
> Sorry, but Porting seems to be not the best way to do this.

I saw a bunch of common code in `Csv`/`JsonExpressionsSuite`, 
`Csv`/`JsonFunctionsSuite` and `Csv`/`JsonSuite`. I just didn't want to 
overcomplicate the tests especially in the case when there are small 
differences. So, passing functions (with inputs and expected result) to 
template functions will not make them easy to read. 

> Could you refactor this by introducing new test helper functions?

In any case, I will try that.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22960: [SPARK-25955][TEST] Porting JSON tests for CSV fu...

2018-11-06 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22960#discussion_r231399775
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala ---
@@ -86,4 +86,82 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
 
 checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") 
:: Nil)
   }
+
+  test("from_csv uses DDL strings for defining a schema - java") {
+val df = Seq("""1,"haa).toDS()
+checkAnswer(
+  df.select(
+from_csv($"value", lit("a INT, b STRING"), new 
java.util.HashMap[String, String]())),
+  Row(Row(1, "haa")) :: Nil)
+  }
+
+  test("roundtrip to_csv -> from_csv") {
+val df = Seq(Tuple1(Tuple1(1)), Tuple1(null)).toDF("struct")
+val schema = df.schema(0).dataType.asInstanceOf[StructType]
+val options = Map.empty[String, String]
+val readback = df.select(to_csv($"struct").as("csv"))
+  .select(from_csv($"csv", schema, options).as("struct"))
+
+checkAnswer(df, readback)
+  }
+
+  test("roundtrip from_csv -> to_csv") {
+val df = Seq(Some("1"), None).toDF("csv")
+val schema = new StructType().add("a", IntegerType)
+val options = Map.empty[String, String]
+val readback = df.select(from_csv($"csv", schema, 
options).as("struct"))
+  .select(to_csv($"struct").as("csv"))
+
+checkAnswer(df, readback)
+  }
+
+  test("infers schemas of a CSV string and pass to to from_csv") {
+val in = Seq("""0.123456789,987654321,"San Francisco).toDS()
+val options = Map.empty[String, String].asJava
+val out = in.select(from_csv('value, schema_of_csv("0.1,1,a"), 
options) as "parsed")
+val expected = StructType(Seq(StructField(
+  "parsed",
+  StructType(Seq(
+StructField("_c0", DoubleType, true),
+StructField("_c1", IntegerType, true),
+StructField("_c2", StringType, true))
+
+assert(out.schema == expected)
+  }
+
+  test("Support to_csv in SQL") {
--- End diff --

This is only for double check that the functions are available/(and work) 
from expressions in Scala. Probably we can make the test smaller.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22951
  
**[Test build #98541 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98541/testReport)**
 for PR 22951 at commit 
[`6ab8501`](https://github.com/apache/spark/commit/6ab850164182565c2cd8cffe99f5c4bb09ead660).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22958: [SPARK-25952][SQL] Passing actual schema to JacksonParse...

2018-11-06 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22958
  
@cloud-fan @HyukjinKwon May I ask you to have a look at this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22938: [SPARK-25935][SQL] Prevent null rows from JSON parser

2018-11-06 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22938
  
@HyukjinKwon Are you ok with the changes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15899
  
Since the issue is closed, this PR will be closed at the next infra clean 
ups.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15899
  
+1 for the decision and closing it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15899
  
I see. Thank you for the clear decision, @rxin ! I'll close the issue as 
`Won't Fix`.

And, could you close this PR, @reggert ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22818: [SPARK-25904][CORE] Allocate arrays smaller than Int.Max...

2018-11-06 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22818
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15899
  
Thanks for the example. I didn't even know that was possible in earlier 
versions. I just looked it up: looks like Scala 2.11 rewrites for 
comprehensions into map, filter, and flatMap.

That said, I don't think it's a bad deal that this no longer works, given 
it was never intended to work and there's been a deprecation warning.

I still maintain that it is risky to support this, because Scala users 
learn for comprehension not just for a simple "for filter yield", but as a way 
to chain multiple generators together, which is not really well supported by 
Spark (even if it is, it's a really bad operation for users to shoot themselves 
in the foot because it would be a cartesian product).

Rather than faking it as a local collection, users should know RDD is not.






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15899
  
Hi, @rxin , @srowen , @dbtsai , @felixcheung , @gatorsmile , @cloud-fan .

I know this was not a recommended style, but there really exists users with 
this issue. And, from Spark 2.4.0, we are releasing Scala-2.12 version as an 
experiment. Here, this case shows a regression because previously the code 
works with a warning. I'm +1 for this idea for Spark's Scala-2.12 supports. How 
do you think about this?

```scala
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = 
local-1541571276105).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> (for (n <- sc.parallelize(Seq(1,2,3)) if n > 2) yield 
n).toDebugString
:25: error: value withFilter is not a member of 
org.apache.spark.rdd.RDD[Int]
   (for (n <- sc.parallelize(Seq(1,2,3)) if n > 2) yield 
n).toDebugString
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15899#discussion_r231390266
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -387,6 +387,14 @@ abstract class RDD[T: ClassTag](
   preservesPartitioning = true)
   }
 
+  /**
+* Return a new RDD containing only the elements that satisfy a 
predicate.
--- End diff --

Why bother unless we have consensus to introduce this API?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19796: [SPARK-22581][SQL] Catalog api does not allow to ...

2018-11-06 Thread timvw
Github user timvw closed the pull request at:

https://github.com/apache/spark/pull/19796


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15899: [SPARK-18466] added withFilter method to RDD

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/15899#discussion_r231389555
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -387,6 +387,14 @@ abstract class RDD[T: ClassTag](
   preservesPartitioning = true)
   }
 
+  /**
+* Return a new RDD containing only the elements that satisfy a 
predicate.
--- End diff --

Hi, @reggert . 
Could you fix the indentation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22089: [SPARK-25098][SQL]‘Cast’ will return NULL whe...

2018-11-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22089


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22943


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22943
  
Thank you, @wangyum and @cloud-fan .
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19796: [SPARK-22581][SQL] Catalog api does not allow to ...

2018-11-06 Thread timvw
Github user timvw commented on a diff in the pull request:

https://github.com/apache/spark/pull/19796#discussion_r231382828
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala ---
@@ -411,7 +410,29 @@ abstract class Catalog {
   tableName: String,
   source: String,
   schema: StructType,
-  options: Map[String, String]): DataFrame
+  options: Map[String, String]): DataFrame = {
+createTable(tableName, source, schema, options, Nil)
+  }
+
+  /**
+* :: Experimental ::
+* (Scala-specific)
+* Create a table based on the dataset in a data source, a schema, a 
set of options and a set of partition columns.
+* Then, returns the corresponding DataFrame.
+*
+* @param tableName is either a qualified or unqualified name that 
designates a table.
+*  If no database identifier is provided, it refers to 
a table in
+*  the current database.
+* @since ???
+*/
+  @Experimental
+  @InterfaceStability.Evolving
+  def createTable(
+tableName: String,
+source: String,
+schema: StructType,
+options: Map[String, String],
+partitionColumnNames : Seq[String]): DataFrame
--- End diff --

Imho, having an API without options to specify partitioning in a big-data 
context is just pointless.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22943#discussion_r231382309
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ---
@@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite {
 c = Calendar.getInstance()
 c.set(2015, 2, 18, 0, 0, 0)
 c.set(Calendar.MILLISECOND, 0)
-assert(stringToDate(UTF8String.fromString("2015-03-18")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T")).get ===
-  millisToDays(c.getTimeInMillis))
+Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", 
"2015-03-18 123142",
--- End diff --

ah i see


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22932
  
Could you review this, @gatorsmile ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22943#discussion_r231381218
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ---
@@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite {
 c = Calendar.getInstance()
 c.set(2015, 2, 18, 0, 0, 0)
 c.set(Calendar.MILLISECOND, 0)
-assert(stringToDate(UTF8String.fromString("2015-03-18")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T")).get ===
-  millisToDays(c.getTimeInMillis))
+Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", 
"2015-03-18 123142",
--- End diff --

New test cases (with space padding) are added; e.g. ` 2015-03-18` and ` 
2015-03-18 `.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22960: [SPARK-25955][TEST] Porting JSON tests for CSV fu...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22960#discussion_r231380992
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala ---
@@ -86,4 +86,82 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
 
 checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") 
:: Nil)
   }
+
+  test("from_csv uses DDL strings for defining a schema - java") {
+val df = Seq("""1,"haa).toDS()
+checkAnswer(
+  df.select(
+from_csv($"value", lit("a INT, b STRING"), new 
java.util.HashMap[String, String]())),
--- End diff --

The only difference is `from_csv` and `from_json`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22943: [SPARK-25098][SQL] Trim the string when cast stri...

2018-11-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22943#discussion_r231380552
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ---
@@ -140,16 +140,10 @@ class DateTimeUtilsSuite extends SparkFunSuite {
 c = Calendar.getInstance()
 c.set(2015, 2, 18, 0, 0, 0)
 c.set(Calendar.MILLISECOND, 0)
-assert(stringToDate(UTF8String.fromString("2015-03-18")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 ")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18 123142")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T123123")).get ===
-  millisToDays(c.getTimeInMillis))
-assert(stringToDate(UTF8String.fromString("2015-03-18T")).get ===
-  millisToDays(c.getTimeInMillis))
+Seq("2015-03-18", "2015-03-18 ", " 2015-03-18", " 2015-03-18 ", 
"2015-03-18 123142",
--- End diff --

the test result doesn't change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22952: [SPARK-20568][SS] Rename files which are complete...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22952#discussion_r231378889
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -530,6 +530,8 @@ Here are the details of all the sources in Spark.
 "s3://a/dataset.txt"
 "s3n://a/b/dataset.txt"
 "s3a://a/b/c/dataset.txt"
+
+renameCompletedFiles: whether to rename completed 
files in previous batch (default: false). If the option is enabled, input file 
will be renamed with additional postfix "_COMPLETED_". This is useful to clean 
up old input files to save space in storage.
--- End diff --

Hi, @HeartSaVioR .
Renaming is expensive in S3, isn't it? I don't worry about HDFS, but do you 
know if there is potential side effects like performance degradation in the 
cloud environment, especially with continuous processing mode?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22951
  
Could you rebase this once again, @MaxGekk ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22943
  
Could you review this, @gatorsmile and @cloud-fan ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22867: [SPARK-25778] WriteAheadLogBackedBlockRDD in YARN Cluste...

2018-11-06 Thread gss2002
Github user gss2002 commented on the issue:

https://github.com/apache/spark/pull/22867
  
@vanzin you are right! I appreciate the help with this one. I will cut a 
patch in the AM after testing on a large scale cluster job that is taking from 
IBM MQ and ETLing data and shipping off to Kafka.

But this looks to work:
   val nonExistentDirectory = new File(
  System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).toURI.toString


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22943
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98540/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22943
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22943
  
**[Test build #98540 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98540/testReport)**
 for PR 22943 at commit 
[`b866d65`](https://github.com/apache/spark/commit/b866d65c534d016f814946236b55ff05f79a4490).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22921
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98537/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22921
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22921
  
**[Test build #98537 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98537/testReport)**
 for PR 22921 at commit 
[`af748d5`](https://github.com/apache/spark/commit/af748d5a2680ffbea859f186cf48c97e1d700ee5).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22951: [SPARK-25945][SQL] Support locale while parsing date/tim...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22951
  
Looks good. I or someone else should take a closer look before getting this 
in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22956: [SPARK-25950][SQL] from_csv should respect to spa...

2018-11-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22956


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22932
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98539/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22932
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22932: [SPARK-25102][SQL] Write Spark version to ORC/Parquet fi...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22932
  
**[Test build #98539 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98539/testReport)**
 for PR 22932 at commit 
[`ef49a27`](https://github.com/apache/spark/commit/ef49a277d3fd39c6fd91b3fcda65f660b833ec95).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22956
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22921
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22921
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98536/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22956: [SPARK-25950][SQL] from_csv should respect to spa...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22956#discussion_r231370599
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -92,8 +93,14 @@ case class CsvToStructs(
 }
   }
 
+  val nameOfCorruptRecord = 
SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
--- End diff --

Yea, I think so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22921: [SPARK-25908][CORE][SQL] Remove old deprecated items in ...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22921
  
**[Test build #98536 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98536/testReport)**
 for PR 22921 at commit 
[`6bcbf79`](https://github.com/apache/spark/commit/6bcbf79a14866c2d6e11bfa7b89a095584cb8228).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22275
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98538/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22275
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22275: [SPARK-25274][PYTHON][SQL] In toPandas with Arrow send o...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22275
  
**[Test build #98538 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98538/testReport)**
 for PR 22275 at commit 
[`bf2feec`](https://github.com/apache/spark/commit/bf2feec2ef023177d72ac1137dbd1b3a02eb9a89).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22937: [SPARK-25934] [Mesos] Don't propagate SPARK_CONF_DIR fro...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22937
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98535/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22937: [SPARK-25934] [Mesos] Don't propagate SPARK_CONF_DIR fro...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22937
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22937: [SPARK-25934] [Mesos] Don't propagate SPARK_CONF_DIR fro...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22937
  
**[Test build #98535 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98535/testReport)**
 for PR 22937 at commit 
[`b500199`](https://github.com/apache/spark/commit/b50019987da954956e407c55e56a4329f8e5633f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22617: [SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsa...

2018-11-06 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22617
  
Retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22911: [SPARK-25815][k8s] Support kerberos in client mod...

2018-11-06 Thread ifilonenko
Github user ifilonenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/22911#discussion_r231359962
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
 ---
@@ -123,7 +126,11 @@ private[spark] class KubernetesClusterSchedulerBackend(
   }
 
   override def createDriverEndpoint(properties: Seq[(String, String)]): 
DriverEndpoint = {
-new KubernetesDriverEndpoint(rpcEnv, properties)
+new KubernetesDriverEndpoint(sc.env.rpcEnv, properties)
+  }
+
+  override protected def createTokenManager(): 
Option[HadoopDelegationTokenManager] = {
+Some(new HadoopDelegationTokenManager(conf, sc.hadoopConfiguration))
--- End diff --

Yeah, I can always throw up a follow-up for that. No worries


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231359624
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

Ok. I will try to make a PR and see if we can have better fix for this. 
Thanks for suggestion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....

2018-11-06 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22956
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22956: [SPARK-25950][SQL] from_csv should respect to spa...

2018-11-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22956#discussion_r231359024
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala
 ---
@@ -92,8 +93,14 @@ case class CsvToStructs(
 }
   }
 
+  val nameOfCorruptRecord = 
SQLConf.get.getConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD)
--- End diff --

should this be private?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231358749
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

I wouldn't special-case primitive type while this is a general problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for ContextBarrierSta...

2018-11-06 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22165
  
gental ping @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231358690
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

I wouldn't special-case primitive type while this is a general problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22961: [SPARK-25947][SQL] Reduce memory usage in ShuffleExchang...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22961
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22961: [SPARK-25947][SQL] Reduce memory usage in Shuffle...

2018-11-06 Thread mu5358271
GitHub user mu5358271 opened a pull request:

https://github.com/apache/spark/pull/22961

[SPARK-25947][SQL] Reduce memory usage in ShuffleExchangeExec by selecting 
only the sort columns

## What changes were proposed in this pull request?

When sorting rows, ShuffleExchangeExec uses the entire row instead of just 
the columns references in SortOrder to create the RangePartitioner. This causes 
the RangePartitioner to sample entire rows to create rangeBounds and can cause 
OOM issues on the driver when rows contain large fields.

This change creates a projection and only use columns involved in the 
SortOrder for the RangePartitioner

## How was this patch tested?

started a local spark-shell with a small spark.driver.maxResultSize:

```
spark-shell --master 'local[16]' --conf spark.driver.maxResultSize=128M 
--driver-memory 4g
```

and ran the following script:

```
import com.google.common.io.Files
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession

import scala.util.Random

@transient val sc = SparkContext.getOrCreate()
@transient val spark = SparkSession.builder().getOrCreate()

import spark.implicits._

val path = Files.createTempDir().toString

// this creates a dataset with 1024 entries, each 1MB in size, across 16 
partitions
sc.parallelize(0 until (1 << 10), sc.defaultParallelism).
  map(_ => Array.fill(1 << 18)(Random.nextInt)).
  toDS.
  write.mode("overwrite").parquet(path)

spark.read.parquet(path).
  orderBy('value (0)).
  write.mode("overwrite").parquet(s"$path-sorted")

spark.read.parquet(s"$path-sorted").show
```
execution would fail when initializing RangePartitioner without this change.
execution succeeds and generates a correctly sorted dataset with this 
change.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mu5358271/spark sort-improvement

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22961.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22961


commit 61288d40475a4145561ea4be566bc63b78c25b5a
Author: shuhengd 
Date:   2018-11-06T04:23:18Z

[SPARK-25947][SQL] Reduce memory usage in ShuffleExchangeExec by selecting 
only the sort columns




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22855: [SPARK-25839] [Core] Implement use of KryoPool in KryoSe...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22855
  
**[Test build #4417 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4417/testReport)**
 for PR 22855 at commit 
[`60310c0`](https://github.com/apache/spark/commit/60310c0e18613f0c32f19b73e6ac25a49ba25e86).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22944: [SPARK-25942][SQL] Fix Dataset.groupByKey to make...

2018-11-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22944#discussion_r231350156
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -262,25 +262,39 @@ object AppendColumns {
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
+}
 new AppendColumns(
   func.asInstanceOf[Any => Any],
   implicitly[Encoder[T]].clsTag.runtimeClass,
   implicitly[Encoder[T]].schema,
   UnresolvedDeserializer(encoderFor[T].deserializer),
-  encoderFor[U].namedExpressions,
+  namedExpressions,
   child)
   }
 
   def apply[T : Encoder, U : Encoder](
   func: T => U,
   inputAttributes: Seq[Attribute],
   child: LogicalPlan): AppendColumns = {
+val outputEncoder = encoderFor[U]
+val namedExpressions = if (!outputEncoder.isSerializedAsStruct) {
+  assert(outputEncoder.namedExpressions.length == 1)
+  outputEncoder.namedExpressions.map(Alias(_, "key")())
+} else {
+  outputEncoder.namedExpressions
--- End diff --

Thanks, I see. For this primitive type case, is current fix ok? Or we 
should deal with case classes together?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22926: [SPARK-25917][Spark UI] memoryMetrics should be Json ign...

2018-11-06 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/22926
  
Is this a problem in master at all?

The data is serialized with `JacksonMessageWriter`, which seems to be 
configured properly:

```
private[v1] class JacksonMessageWriter extends MessageBodyWriter[Object]{
  ...
  mapper.setSerializationInclusion(JsonInclude.Include.NON_ABSENT)
```

An easy way to answer that question is to write a unit test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22590
  
I wonder how important it is. I know `spark-csv` at Databricks supported 
different quote modes and that's gone when we ported that into Spark - the root 
cause was due to replacing the library from apache-common into univocity.

After few years, I only saw one request about reviving the quote mode 
proposed here - so I suspect how important it is. 

Basically, @MaxGekk described my stand correctly. Can we investigate a way 
to set the arbitrary parse settings options?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22911: [SPARK-25815][k8s] Support kerberos in client mod...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22911#discussion_r231348306
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
 ---
@@ -123,7 +126,11 @@ private[spark] class KubernetesClusterSchedulerBackend(
   }
 
   override def createDriverEndpoint(properties: Seq[(String, String)]): 
DriverEndpoint = {
-new KubernetesDriverEndpoint(rpcEnv, properties)
+new KubernetesDriverEndpoint(sc.env.rpcEnv, properties)
+  }
+
+  override protected def createTokenManager(): 
Option[HadoopDelegationTokenManager] = {
+Some(new HadoopDelegationTokenManager(conf, sc.hadoopConfiguration))
--- End diff --

Ah, ok I get it now. I can do that. I'll try to include support for (3) but 
it depends on how much I have to touch other parts of the code. Hopefully not 
much.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22590: [SPARK-25574][SQL]Add an option `keepQuotes` for parsing...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22590
  
They should be documented in API doc like `DataFrameReader.scala`. For 
site, we should avoid doc duplication - It's a general issue to document 
options.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22504: [SPARK-25118][Submit] Persist Driver Logs in Clie...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22504#discussion_r231346067
  
--- Diff: docs/configuration.md ---
@@ -266,6 +266,40 @@ of the most common options to set are:
 Only has effect in Spark standalone mode or Mesos cluster deploy mode.
   
 
+
+  spark.driver.log.dfsDir
+  (none)
+  
+Base directory in which Spark driver logs are synced, if 
spark.driver.log.persistToDfs.enabled
+is true. Within this base directory, each application logs the driver 
logs to an application specific file.
+Users may want to set this to a unified location like an HDFS 
directory so driver log files can be persisted
+for later usage. This directory should allow any Spark user to 
read/write files and the Spark History Server
+user to delete files. Additionally, older logs from this directory are 
cleaned by
+ 
Spark History Server  if
--- End diff --

remove space after `>`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22504: [SPARK-25118][Submit] Persist Driver Logs in Clie...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22504#discussion_r231346390
  
--- Diff: docs/configuration.md ---
@@ -266,6 +266,40 @@ of the most common options to set are:
 Only has effect in Spark standalone mode or Mesos cluster deploy mode.
   
 
+
+  spark.driver.log.dfsDir
+  (none)
+  
+Base directory in which Spark driver logs are synced, if 
spark.driver.log.persistToDfs.enabled
+is true. Within this base directory, each application logs the driver 
logs to an application specific file.
+Users may want to set this to a unified location like an HDFS 
directory so driver log files can be persisted
+for later usage. This directory should allow any Spark user to 
read/write files and the Spark History Server
+user to delete files. Additionally, older logs from this directory are 
cleaned by
+ 
Spark History Server  if
+spark.history.fs.driverlog.cleaner.enabled is true and, 
if they are older than max age configured
+at spark.history.fs.driverlog.cleaner.maxAge.
+  
+
+
+  spark.driver.log.persistToDfs.enabled
+  false
+  
+If true, spark application running in client mode will write driver 
logs to a persistent storage, configured
+in spark.driver.log.dfsDir. If 
spark.driver.log.dfsDir is not configured, driver logs
+will not be persisted. Additionally, enable the cleaner by setting 
spark.history.fs.driverlog.cleaner.enabled
+to true in  Spark 
History Server.
--- End diff --

no space after `>`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22504: [SPARK-25118][Submit] Persist Driver Logs in Clie...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22504#discussion_r231346161
  
--- Diff: docs/configuration.md ---
@@ -266,6 +266,40 @@ of the most common options to set are:
 Only has effect in Spark standalone mode or Mesos cluster deploy mode.
   
 
+
+  spark.driver.log.dfsDir
+  (none)
+  
+Base directory in which Spark driver logs are synced, if 
spark.driver.log.persistToDfs.enabled
+is true. Within this base directory, each application logs the driver 
logs to an application specific file.
+Users may want to set this to a unified location like an HDFS 
directory so driver log files can be persisted
+for later usage. This directory should allow any Spark user to 
read/write files and the Spark History Server
+user to delete files. Additionally, older logs from this directory are 
cleaned by
+ 
Spark History Server  if
+spark.history.fs.driverlog.cleaner.enabled is true and, 
if they are older than max age configured
+at spark.history.fs.driverlog.cleaner.maxAge.
--- End diff --

s/at/by setting


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22504: [SPARK-25118][Submit] Persist Driver Logs in Clie...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22504#discussion_r231346117
  
--- Diff: docs/configuration.md ---
@@ -266,6 +266,40 @@ of the most common options to set are:
 Only has effect in Spark standalone mode or Mesos cluster deploy mode.
   
 
+
+  spark.driver.log.dfsDir
+  (none)
+  
+Base directory in which Spark driver logs are synced, if 
spark.driver.log.persistToDfs.enabled
+is true. Within this base directory, each application logs the driver 
logs to an application specific file.
+Users may want to set this to a unified location like an HDFS 
directory so driver log files can be persisted
+for later usage. This directory should allow any Spark user to 
read/write files and the Spark History Server
+user to delete files. Additionally, older logs from this directory are 
cleaned by
--- End diff --

...cleaned by the...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22504: [SPARK-25118][Submit] Persist Driver Logs in Clie...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22504#discussion_r231346507
  
--- Diff: docs/configuration.md ---
@@ -266,6 +266,40 @@ of the most common options to set are:
 Only has effect in Spark standalone mode or Mesos cluster deploy mode.
   
 
+
+  spark.driver.log.dfsDir
+  (none)
+  
+Base directory in which Spark driver logs are synced, if 
spark.driver.log.persistToDfs.enabled
+is true. Within this base directory, each application logs the driver 
logs to an application specific file.
+Users may want to set this to a unified location like an HDFS 
directory so driver log files can be persisted
+for later usage. This directory should allow any Spark user to 
read/write files and the Spark History Server
+user to delete files. Additionally, older logs from this directory are 
cleaned by
+ 
Spark History Server  if
+spark.history.fs.driverlog.cleaner.enabled is true and, 
if they are older than max age configured
+at spark.history.fs.driverlog.cleaner.maxAge.
+  
+
+
+  spark.driver.log.persistToDfs.enabled
+  false
+  
+If true, spark application running in client mode will write driver 
logs to a persistent storage, configured
+in spark.driver.log.dfsDir. If 
spark.driver.log.dfsDir is not configured, driver logs
+will not be persisted. Additionally, enable the cleaner by setting 
spark.history.fs.driverlog.cleaner.enabled
+to true in  Spark 
History Server.
+  
+
+
+  spark.driver.log.layout
+  %d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n
+  
+The layout for the driver logs that are synced to 
spark.driver.log.dfsDir. If 
+spark.driver.log.persistToDfs.enabled is true and this 
configuration is used. If this is not configured,
--- End diff --

No need to mention the `enabled` option here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22504: [SPARK-25118][Submit] Persist Driver Logs in Clie...

2018-11-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22504#discussion_r231346593
  
--- Diff: docs/monitoring.md ---
@@ -202,6 +202,28 @@ Security options for the Spark History Server are 
covered more detail in the
   applications that fail to rename their event logs listed as 
in-progress.
 
   
+  
+spark.history.fs.driverlog.cleaner.enabled
+spark.history.fs.cleaner.enabled
+
+  Specifies whether the History Server should periodically clean up 
driver logs from storage.
+
+  
+  
+spark.history.fs.driverlog.cleaner.interval
+spark.history.fs.cleaner.interval
+
+  How often the filesystem driver log history cleaner checks for files 
to delete.
--- End diff --

driver log cleaner


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22954: [DO-NOT-MERGE][POC] Enables Arrow optimization from R Da...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22954
  
So far, the regressions tests are passed and newly added test for R 
optimization is verified locally. Let me fix CRAN test and some nits.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22956: [SPARK-25950][SQL] from_csv should respect to spark.sql....

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22956
  
Looks good. I or someone else should take a closer look before getting this 
in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22911: [SPARK-25815][k8s] Support kerberos in client mod...

2018-11-06 Thread ifilonenko
Github user ifilonenko commented on a diff in the pull request:

https://github.com/apache/spark/pull/22911#discussion_r231344398
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala
 ---
@@ -123,7 +126,11 @@ private[spark] class KubernetesClusterSchedulerBackend(
   }
 
   override def createDriverEndpoint(properties: Seq[(String, String)]): 
DriverEndpoint = {
-new KubernetesDriverEndpoint(rpcEnv, properties)
+new KubernetesDriverEndpoint(sc.env.rpcEnv, properties)
+  }
+
+  override protected def createTokenManager(): 
Option[HadoopDelegationTokenManager] = {
+Some(new HadoopDelegationTokenManager(conf, sc.hadoopConfiguration))
--- End diff --

Oh, I was referencing the creation of `Delegation Token` secret when a 
`--keytab` is specified. I believe that you are right in that in client-mode 
you would not need to worry about running this step. But I think the 3rd option 
would be good to include here. I think that with the introduction of 
`HadoopDelegationTokenManager` we should remove the creation of the `dtSecret`, 
and that should be included in this PR if you are introducing this. Therefore, 
I think it is sensible to refactor the `KerberosConfigSpec` to have a generic 
`secret`, `secretName`, `secretKey`, that would either contain a 
`DelegationToken` or a `keytab`.  Such that the code block: 
```
  private val kerberosConfSpec: Option[KerberosConfigSpec] = (for {
secretName <- existingSecretName
secretItemKey <- existingSecretItemKey
  } yield {
KerberosConfigSpec(
  secret = None,
  secretName = secretName,
  secretItemKey = secretItemKey,
  jobUserName = kubeTokenManager.getCurrentUser.getShortUserName)
  }).orElse(
if (isKerberosEnabled) {
  keytab.map { . }
} else {
  None
}
```
would return a kerberosConfSpec that would account for either case. Erm, 
that would also mean that you could delete the `HadoopKerberosLogin` method. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22960: [SPARK-25955][TEST] Porting JSON tests for CSV fu...

2018-11-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22960#discussion_r231344120
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala ---
@@ -86,4 +86,82 @@ class CsvFunctionsSuite extends QueryTest with 
SharedSQLContext {
 
 checkAnswer(df.select(to_csv($"a", options)), Row("26/08/2015 18:00") 
:: Nil)
   }
+
+  test("from_csv uses DDL strings for defining a schema - java") {
+val df = Seq("""1,"haa).toDS()
+checkAnswer(
+  df.select(
+from_csv($"value", lit("a INT, b STRING"), new 
java.util.HashMap[String, String]())),
+  Row(Row(1, "haa")) :: Nil)
+  }
+
+  test("roundtrip to_csv -> from_csv") {
+val df = Seq(Tuple1(Tuple1(1)), Tuple1(null)).toDF("struct")
+val schema = df.schema(0).dataType.asInstanceOf[StructType]
+val options = Map.empty[String, String]
+val readback = df.select(to_csv($"struct").as("csv"))
+  .select(from_csv($"csv", schema, options).as("struct"))
+
+checkAnswer(df, readback)
+  }
+
+  test("roundtrip from_csv -> to_csv") {
+val df = Seq(Some("1"), None).toDF("csv")
+val schema = new StructType().add("a", IntegerType)
+val options = Map.empty[String, String]
+val readback = df.select(from_csv($"csv", schema, 
options).as("struct"))
+  .select(to_csv($"struct").as("csv"))
+
+checkAnswer(df, readback)
+  }
+
+  test("infers schemas of a CSV string and pass to to from_csv") {
+val in = Seq("""0.123456789,987654321,"San Francisco).toDS()
+val options = Map.empty[String, String].asJava
+val out = in.select(from_csv('value, schema_of_csv("0.1,1,a"), 
options) as "parsed")
+val expected = StructType(Seq(StructField(
+  "parsed",
+  StructType(Seq(
+StructField("_c0", DoubleType, true),
+StructField("_c1", IntegerType, true),
+StructField("_c2", StringType, true))
+
+assert(out.schema == expected)
+  }
+
+  test("Support to_csv in SQL") {
--- End diff --

@MaxGekk, wouldn't the tests in `csv-functions.sql` be enough for SQL 
support test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22894: [SPARK-25885][Core][Minor] HighlyCompressedMapStatus des...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22894
  
**[Test build #4416 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4416/testReport)**
 for PR 22894 at commit 
[`57bdd75`](https://github.com/apache/spark/commit/57bdd7525f3353a6d59772b2a86abbe6a0d5f4ba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22909
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22909
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98532/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22943
  
**[Test build #98540 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98540/testReport)**
 for PR 22943 at commit 
[`b866d65`](https://github.com/apache/spark/commit/b866d65c534d016f814946236b55ff05f79a4490).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22909: [SPARK-25897][k8s] Hook up k8s integration tests to sbt ...

2018-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22909
  
**[Test build #98532 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98532/testReport)**
 for PR 22909 at commit 
[`f07f50c`](https://github.com/apache/spark/commit/f07f50c4e495eb25f92a930e424a579da68c5be6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22943
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4808/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22943: [SPARK-25098][SQL] Trim the string when cast stringToTim...

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22943
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22926: [SPARK-25917][Spark UI] memoryMetrics should be Json ign...

2018-11-06 Thread jianjianjiao
Github user jianjianjiao commented on the issue:

https://github.com/apache/spark/pull/22926
  
@AmplabJenkins  Could you please find someone to review this? I believe 
this is a bug in Spark UI. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22960
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22960: [SPARK-25955][TEST] Porting JSON tests for CSV functions

2018-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22960
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98531/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >