date:20180520

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90871/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21370
  
**[Test build #90871 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90871/testReport)**
 for PR 21370 at commit 
[`ebc0b11`](https://github.com/apache/spark/commit/ebc0b11fd006386d32949f56228e2671297373fc).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90872/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21370
  
**[Test build #90872 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90872/testReport)**
 for PR 21370 at commit 
[`f2bb8f3`](https://github.com/apache/spark/commit/f2bb8f334631734869ddf5d8ef1eca1fa29d334a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...

2018-05-20 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21316#discussion_r189496880
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -1607,7 +1607,9 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def reduce(func: (T, T) => T): T = rdd.reduce(func)
+  def reduce(func: (T, T) => T): T = withNewExecutionId {
--- End diff --

@rxin Sorry, I missed that comment. I just thought what `reduce` can do is 
what type-safe `groupByKey` can do. But, probably I missed something.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189493817
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
+and html table will feedback the queries user have defined (see
+https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for 
more details).
+  
+
+
+  spark.jupyter.default.showRows
--- End diff --

`spark.jupyter.eagerEval.showRows` or something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189493455
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -358,6 +357,43 @@ class Dataset[T] private[sql](
 sb.toString()
   }
 
+  /**
+   * Transform current row string and append to builder
+   *
+   * @param row   Current row of string
+   * @param truncate  If set to more than 0, truncates strings to 
`truncate` characters and
+   *all cells will be aligned right.
+   * @param colWidths The width of each column
+   * @param html  If set to true, return output as html table.
+   * @param head  Set to true while current row is table head.
+   * @param sbStringBuilder for current row.
+   */
+  private[sql] def appendRowString(
+  row: Seq[String],
+  truncate: Int,
+  colWidths: Array[Int],
+  html: Boolean,
+  head: Boolean,
+  sb: StringBuilder): Unit = {
+val data = row.zipWithIndex.map { case (cell, i) =>
+  if (truncate > 0) {
+StringUtils.leftPad(cell, colWidths(i))
+  } else {
+StringUtils.rightPad(cell, colWidths(i))
+  }
+}
+(html, head) match {
+  case (true, true) =>
+data.map(StringEscapeUtils.escapeHtml).addString(
+  sb, "", "", "")
--- End diff --

nit: add `\n`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189495783
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -78,6 +78,12 @@ def __init__(self, jdf, sql_ctx):
 self.is_cached = False
 self._schema = None  # initialized lazily
 self._lazy_rdd = None
+self._eager_eval = sql_ctx.getConf(
+"spark.jupyter.eagerEval.enabled", "false").lower() == "true"
+self._default_console_row = int(sql_ctx.getConf(
+"spark.jupyter.default.showRows", u"20"))
+self._default_console_truncate = int(sql_ctx.getConf(
+"spark.jupyter.default.showRows", u"20"))
--- End diff --

`spark.jupyter.default.truncate`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189493218
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -292,31 +297,25 @@ class Dataset[T] private[sql](
   }
 
   // Create SeparateLine
-  val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", 
"+\n").toString()
+  val sep: String = if (html) {
+// Initial append table label
+sb.append("\n")
+"\n"
+  } else {
+colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString()
+  }
 
   // column names
-  rows.head.zipWithIndex.map { case (cell, i) =>
-if (truncate > 0) {
-  StringUtils.leftPad(cell, colWidths(i))
-} else {
-  StringUtils.rightPad(cell, colWidths(i))
-}
-  }.addString(sb, "|", "|", "|\n")
-
+  appendRowString(rows.head, truncate, colWidths, html, true, sb)
   sb.append(sep)
 
   // data
-  rows.tail.foreach {
-_.zipWithIndex.map { case (cell, i) =>
-  if (truncate > 0) {
-StringUtils.leftPad(cell.toString, colWidths(i))
-  } else {
-StringUtils.rightPad(cell.toString, colWidths(i))
-  }
-}.addString(sb, "|", "|", "|\n")
+  rows.tail.foreach { row =>
+appendRowString(row.map(_.toString), truncate, colWidths, html, 
false, sb)
--- End diff --

We don't need `.map(_.toString)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189494064
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -347,13 +353,18 @@ def show(self, n=20, truncate=True, vertical=False):
  name | Bob
 """
 if isinstance(truncate, bool) and truncate:
-print(self._jdf.showString(n, 20, vertical))
+print(self._jdf.showString(n, 20, vertical, False))
 else:
-print(self._jdf.showString(n, int(truncate), vertical))
+print(self._jdf.showString(n, int(truncate), vertical, False))
 
 def __repr__(self):
 return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in 
self.dtypes))
 
+def _repr_html_(self):
+if self._eager_eval:
+return self._jdf.showString(
+self._default_console_row, self._default_console_truncate, 
False, True)
--- End diff --

What will be shown if `spark.jupyter.eagerEval.enabled` is `False`? 
Fallback the original automatically?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189495803
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -78,6 +78,12 @@ def __init__(self, jdf, sql_ctx):
 self.is_cached = False
 self._schema = None  # initialized lazily
 self._lazy_rdd = None
+self._eager_eval = sql_ctx.getConf(
+"spark.jupyter.eagerEval.enabled", "false").lower() == "true"
+self._default_console_row = int(sql_ctx.getConf(
+"spark.jupyter.default.showRows", u"20"))
+self._default_console_truncate = int(sql_ctx.getConf(
+"spark.jupyter.default.showRows", u"20"))
--- End diff --

I guess we shouldn't hold these three values but extract as `@property` or 
refer each time in `_repr_html_`. Otherwise, we'll hit unexpected behavior, 
e.g.:

```python
df = ...
spark.conf.set("spark.jupyter.eagerEval.enabled", True)
df
```

won't show the html.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189496454
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
--- End diff --

`true` instead of `yes`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189493846
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
+and html table will feedback the queries user have defined (see
+https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for 
more details).
+  
+
+
+  spark.jupyter.default.showRows
+  20
+  
+Default number of rows in jupyter html table.
+  
+
+
+  spark.jupyter.default.truncate
--- End diff --

`spark.jupyter.eagerEval.truncate` or something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189493461
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -358,6 +357,43 @@ class Dataset[T] private[sql](
 sb.toString()
   }
 
+  /**
+   * Transform current row string and append to builder
+   *
+   * @param row   Current row of string
+   * @param truncate  If set to more than 0, truncates strings to 
`truncate` characters and
+   *all cells will be aligned right.
+   * @param colWidths The width of each column
+   * @param html  If set to true, return output as html table.
+   * @param head  Set to true while current row is table head.
+   * @param sbStringBuilder for current row.
+   */
+  private[sql] def appendRowString(
+  row: Seq[String],
+  truncate: Int,
+  colWidths: Array[Int],
+  html: Boolean,
+  head: Boolean,
+  sb: StringBuilder): Unit = {
+val data = row.zipWithIndex.map { case (cell, i) =>
+  if (truncate > 0) {
+StringUtils.leftPad(cell, colWidths(i))
+  } else {
+StringUtils.rightPad(cell, colWidths(i))
+  }
+}
+(html, head) match {
+  case (true, true) =>
+data.map(StringEscapeUtils.escapeHtml).addString(
+  sb, "", "", "")
+  case (true, false) =>
+data.map(StringEscapeUtils.escapeHtml).addString(
+  sb, "", "", "")
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189493854
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, 
ProjectionOverSchema, SelectedField}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, 
StructField, StructType}
+
+/**
+ * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
+ * [[ParquetRelation]]. By "Parquet column", we mean a column as defined 
in the
+ * Parquet format. In Spark SQL, a root-level Parquet column corresponds 
to a
+ * SQL column, and a nested Parquet column corresponds to a 
[[StructField]].
+ */
+private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan =
+if (SQLConf.get.nestedSchemaPruningEnabled) {
+  apply0(plan)
+} else {
+  plan
+}
+
+  private def apply0(plan: LogicalPlan): LogicalPlan =
+plan transformDown {
+  case op @ PhysicalOperation(projects, filters,
+  l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, 
partitionSchema,
+dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) 
=>
+val projectionFields = projects.flatMap(getFields)
+val filterFields = filters.flatMap(getFields)
+val requestedFields = (projectionFields ++ filterFields).distinct
+
+// If [[requestedFields]] includes a nested field, continue. 
Otherwise,
+// return [[op]]
+if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) 
{
+  val prunedSchema = requestedFields
+.map { case (field, _) => StructType(Array(field)) }
+.reduceLeft(_ merge _)
+  val dataSchemaFieldNames = dataSchema.fieldNames.toSet
+  val prunedDataSchema =
+StructType(prunedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name)))
+
+  // If the data schema is different from the pruned data schema, 
continue. Otherwise,
+  // return [[op]]. We effect this comparison by counting the 
number of "leaf" fields in
+  // each schemata, assuming the fields in [[prunedDataSchema]] 
are a subset of the fields
+  // in [[dataSchema]].
+  if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
+val prunedParquetRelation =
+  hadoopFsRelation.copy(dataSchema = 
prunedDataSchema)(hadoopFsRelation.sparkSession)
+
+// We need to replace the expression ids of the pruned 
relation output attributes
+// with the expression ids of the original relation output 
attributes so that
+// references to the original relation's output are not broken
+val outputIdMap = l.output.map(att => (att.name, 
att.exprId)).toMap
+val prunedRelationOutput =
+  prunedParquetRelation
+.schema
+.toAttributes
+.map {
+  case att if outputIdMap.contains(att.name) =>
+att.withExprId(outputIdMap(att.name))
+  case att => att
+}
+val prunedRelation =
+  l.copy(relation = prunedParquetRelation, output = 
prunedRelationOutput)
+

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189493986
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, 
ProjectionOverSchema, SelectedField}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, 
StructField, StructType}
+
+/**
+ * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
+ * [[ParquetRelation]]. By "Parquet column", we mean a column as defined 
in the
+ * Parquet format. In Spark SQL, a root-level Parquet column corresponds 
to a
+ * SQL column, and a nested Parquet column corresponds to a 
[[StructField]].
+ */
+private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan =
+if (SQLConf.get.nestedSchemaPruningEnabled) {
+  apply0(plan)
+} else {
+  plan
+}
+
+  private def apply0(plan: LogicalPlan): LogicalPlan =
+plan transformDown {
+  case op @ PhysicalOperation(projects, filters,
+  l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, 
partitionSchema,
+dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) 
=>
+val projectionFields = projects.flatMap(getFields)
+val filterFields = filters.flatMap(getFields)
+val requestedFields = (projectionFields ++ filterFields).distinct
+
+// If [[requestedFields]] includes a nested field, continue. 
Otherwise,
+// return [[op]]
+if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) 
{
+  val prunedSchema = requestedFields
+.map { case (field, _) => StructType(Array(field)) }
+.reduceLeft(_ merge _)
+  val dataSchemaFieldNames = dataSchema.fieldNames.toSet
+  val prunedDataSchema =
+StructType(prunedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name)))
+
+  // If the data schema is different from the pruned data schema, 
continue. Otherwise,
+  // return [[op]]. We effect this comparison by counting the 
number of "leaf" fields in
+  // each schemata, assuming the fields in [[prunedDataSchema]] 
are a subset of the fields
+  // in [[dataSchema]].
+  if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
+val prunedParquetRelation =
+  hadoopFsRelation.copy(dataSchema = 
prunedDataSchema)(hadoopFsRelation.sparkSession)
+
+// We need to replace the expression ids of the pruned 
relation output attributes
+// with the expression ids of the original relation output 
attributes so that
+// references to the original relation's output are not broken
+val outputIdMap = l.output.map(att => (att.name, 
att.exprId)).toMap
+val prunedRelationOutput =
+  prunedParquetRelation
+.schema
+.toAttributes
+.map {
+  case att if outputIdMap.contains(att.name) =>
+att.withExprId(outputIdMap(att.name))
+  case att => att
+}
+val prunedRelation =
+  l.copy(relation = prunedParquetRelation, output = 
prunedRelationOutput)
+

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189491559
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -879,6 +879,15 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("select function over nested data") {
--- End diff --

Without this PR, this test still can pass, right? 

Could you submit a separate PR for these test coverage improvement? We 
really welcome the test coverage improvement PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189491063
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.types.StructType
+
+/**
+ * An optional mix-in for columnar [[FileFormat]]s. This trait provides 
some helpful metadata when
+ * debugging a physical query plan.
+ */
+private[sql] trait ColumnarFileFormat {
--- End diff --

Can we do this in a separate PR?  No need to block this PR due to the 
discussion about this implementation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189489061
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -162,7 +162,9 @@ case class FilterExec(condition: Expression, child: 
SparkPlan)
 val generatedIsNotNullChecks = new Array[Boolean](notNullPreds.length)
 val generated = otherPreds.map { c =>
   val nullChecks = c.references.map { r =>
-val idx = notNullPreds.indexWhere { n => 
n.asInstanceOf[IsNotNull].child.semanticEquals(r)}
+val idx = notNullPreds.indexWhere { n =>
+  n.asInstanceOf[IsNotNull].child.references.contains(r)
--- End diff --

Is this change related?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189489577
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala
 ---
@@ -99,27 +100,28 @@ trait ConstraintHelper {
   }
 
   /**
-   * Infer the Attribute-specific IsNotNull constraints from the null 
intolerant child expressions
-   * of constraints.
+   * Infer the Attribute and ExtractValue-specific IsNotNull constraints 
from the null intolerant
+   * child expressions of constraints.
*/
   private def inferIsNotNullConstraints(constraint: Expression): 
Seq[Expression] =
 constraint match {
   // When the root is IsNotNull, we can push IsNotNull through the 
child null intolerant
   // expressions
-  case IsNotNull(expr) => 
scanNullIntolerantAttribute(expr).map(IsNotNull(_))
+  case IsNotNull(expr) => 
scanNullIntolerantField(expr).map(IsNotNull(_))
   // Constraints always return true for all the inputs. That means, 
null will never be returned.
   // Thus, we can infer `IsNotNull(constraint)`, and also push 
IsNotNull through the child
   // null intolerant expressions.
-  case _ => scanNullIntolerantAttribute(constraint).map(IsNotNull(_))
+  case _ => scanNullIntolerantField(constraint).map(IsNotNull(_))
 }
 
   /**
-   * Recursively explores the expressions which are null intolerant and 
returns all attributes
-   * in these expressions.
+   * Recursively explores the expressions which are null intolerant and 
returns all attributes and
+   * complex type extractors in these expressions.
*/
-  private def scanNullIntolerantAttribute(expr: Expression): 
Seq[Attribute] = expr match {
+  private def scanNullIntolerantField(expr: Expression): Seq[Expression] = 
expr match {
+case ev: ExtractValue => Seq(ev)
--- End diff --

For this improvement, can we do it in a separate PR? The corresponding unit 
test case are needed in `InferFiltersFromConstraintsSuite` instead of 
`ParquetSchemaPruningSuite`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189491217
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -286,7 +286,19 @@ case class FileSourceScanExec(
   } getOrElse {
 metadata
   }
-withOptPartitionCount
+val withOptColumnCount = relation.fileFormat match {
+  case columnar: ColumnarFileFormat =>
+SparkSession
+  .getActiveSession
+  .map { sparkSession =>
+val columnCount = columnar.columnCountForSchema(sparkSession, 
requiredSchema)
+withOptPartitionCount + ("ColumnCount" -> columnCount.toString)
--- End diff --

This needs to be in a separate PR as I suggested above. BTW, we could 
easily lose this metadata if this change does not have a test case. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189479383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1256,8 +1256,18 @@ object SQLConf {
 "issues. Turn on this config to insert a local sort before 
actually doing repartition " +
 "to generate consistent repartition results. The performance of 
repartition() may go " +
 "down since we insert extra local sort before it.")
+.booleanConf
+.createWithDefault(true)
+
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data. Currently Parquet is the 
only data source that " +
--- End diff --

How about ORC? 

cc @dongjoon-hyun Do you know whether it is also doable in the latest ORC 
version?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-20 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r189492534
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.exceptions.TestFailedException
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.NamedExpression
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types._
+
+// scalastyle:off line.size.limit
+class SelectedFieldSuite extends SparkFunSuite with BeforeAndAfterAll {
+  // The test schema as a tree string, i.e. `schema.treeString`
+  // root
+  //  |-- col1: string (nullable = false)
+  //  |-- col2: struct (nullable = true)
+  //  ||-- field1: integer (nullable = true)
+  //  ||-- field2: array (nullable = true)
+  //  |||-- element: integer (containsNull = false)
+  //  ||-- field3: array (nullable = false)
+  //  |||-- element: struct (containsNull = true)
+  //  ||||-- subfield1: integer (nullable = true)
+  //  ||||-- subfield2: integer (nullable = true)
+  //  ||||-- subfield3: array (nullable = true)
+  //  |||||-- element: integer (containsNull = true)
+  //  ||-- field4: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: struct (valueContainsNull = false)
+  //  ||||-- subfield1: integer (nullable = true)
+  //  ||||-- subfield2: array (nullable = true)
+  //  |||||-- element: integer (containsNull = false)
+  //  ||-- field5: array (nullable = false)
+  //  |||-- element: struct (containsNull = true)
+  //  ||||-- subfield1: struct (nullable = false)
+  //  |||||-- subsubfield1: integer (nullable = true)
+  //  |||||-- subsubfield2: integer (nullable = true)
+  //  ||||-- subfield2: struct (nullable = true)
+  //  |||||-- subsubfield1: struct (nullable = true)
+  //  ||||||-- subsubsubfield1: string (nullable = 
true)
+  //  |||||-- subsubfield2: integer (nullable = true)
+  //  ||-- field6: struct (nullable = true)
+  //  |||-- subfield1: string (nullable = false)
+  //  |||-- subfield2: string (nullable = true)
+  //  ||-- field7: struct (nullable = true)
+  //  |||-- subfield1: struct (nullable = true)
+  //  ||||-- subsubfield1: integer (nullable = true)
+  //  ||||-- subsubfield2: integer (nullable = true)
+  //  ||-- field8: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: array (valueContainsNull = false)
+  //  ||||-- element: struct (containsNull = true)
+  //  |||||-- subfield1: integer (nullable = true)
+  //  |||||-- subfield2: array (nullable = true)
+  //  ||||||-- element: integer (containsNull = false)
+  //  ||-- field9: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: integer (valueContainsNull = false)
+  //  |-- col3: array (nullable = false)
+  //  ||-- element: struct (containsNull = false)
+  //  |||-- field1: struct (nullable = true)
+  //  ||||-- subfield1: integer (nullable = false)
+  //  ||||-- subfield2: integer (nullable = true)
+  //  |||-- field2: map (nullable = true)
+  //  ||||-- key: string
+  //  ||||-- value: integer

[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21374
  
Also, I don't think we need a JIRA for each lint break. It's "virtually 
same" before after.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21374
  
Yea, so I am saying that we probably do this less frequently. LGTM btw if 
it fixes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90877/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90877 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90877/testReport)**
 for PR 21266 at commit 
[`fad31b7`](https://github.com/apache/spark/commit/fad31b7582266f96c8dba1eb83ab73f7aed893f8).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3406/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21288
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21288
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3405/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...

2018-05-20 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21193
  
@cloud-fan @rednaxelafx Your last comments are addressed. Please check if 
you have other comments. Thanks for review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90879 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90879/testReport)**
 for PR 21266 at commit 
[`d8c308f`](https://github.com/apache/spark/commit/d8c308fa43a001328b8645e0d339875342c25c67).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20872: [SPARK-24328][SQL] Fix scala.MatchError in literals.sql....

2018-05-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20872
  
@cloud-fan Could you check and resolve the jira? Thanks! 
https://issues.apache.org/jira/browse/SPARK-24328


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3404/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21288
  
**[Test build #90878 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90878/testReport)**
 for PR 21288 at commit 
[`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90875/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90875 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90875/testReport)**
 for PR 21266 at commit 
[`3b6f541`](https://github.com/apache/spark/commit/3b6f541d616d458fa90aa7e70d89d56dd41394f6).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21266
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3403/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90877 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90877/testReport)**
 for PR 21266 at commit 
[`fad31b7`](https://github.com/apache/spark/commit/fad31b7582266f96c8dba1eb83ab73f7aed893f8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21372
  
You've already checked if we have no performance difference, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3402/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...

2018-05-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21363#discussion_r189490878
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala
 ---
@@ -59,13 +59,21 @@ class CSVInferSchemaSuite extends SparkFunSuite {
 assert(CSVInferSchema.inferField(IntegerType, textValueOne, options) 
== expectedTypeOne)
   }
 
-  test("Timestamp field types are inferred correctly via custom data 
format") {
-var options = new CSVOptions(Map("timestampFormat" -> "-mm"), 
"GMT")
+  test("Timestamp field types are inferred correctly via custom date 
format") {
+var options = new CSVOptions(Map("timestampFormat" -> "-MM"), 
"GMT")
--- End diff --

Why we need to change this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...

2018-05-20 Thread maropu

Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21288#discussion_r189490682
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala ---
@@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils}
  */
 object FilterPushdownBenchmark {
   val conf = new SparkConf()
-  conf.set("orc.compression", "snappy")
-  conf.set("spark.sql.parquet.compression.codec", "snappy")
+.setMaster("local[1]")
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21266
  
**[Test build #90875 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90875/testReport)**
 for PR 21266 at commit 
[`3b6f541`](https://github.com/apache/spark/commit/3b6f541d616d458fa90aa7e70d89d56dd41394f6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20345
  
**[Test build #90876 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90876/testReport)**
 for PR 20345 at commit 
[`94d9171`](https://github.com/apache/spark/commit/94d9171b8ec26c21724dd393cf4fc83ff52623e7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors

2018-05-20 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21374
  
Until now, @gatorsmile and @ueshin fixed these when we found. I am neutral 
on the policy.
I would like to hear their opinion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...

2018-05-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21363#discussion_r189489932
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
 ---
@@ -140,14 +141,23 @@ private[csv] object CSVInferSchema {
   private def tryParseDouble(field: String, options: CSVOptions): DataType 
= {
 if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field, 
options)) {
   DoubleType
+} else {
+  tryParseDate(field, options)
--- End diff --

Is this a behavior change? Previously timestamp type, now date 
type/timestamp type?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-05-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20345
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21379
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21379
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3401/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...

2018-05-20 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21379
  
@gatorsmile @conorbmurphy


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21379
  
**[Test build #90874 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90874/testReport)**
 for PR 21379 at commit 
[`8d97b0d`](https://github.com/apache/spark/commit/8d97b0deb5ed96094f70f16376b677fe3ff1bdfc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21379: [SPARK-24327][SQL] Add an option to quote a parti...

2018-05-20 Thread maropu

GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21379

[SPARK-24327][SQL] Add an option to quote a partition column name in 
JDBCRelation

## What changes were proposed in this pull request?
This pr added a new option to quote a partition column name in 
`JDBCRelation`.

## How was this patch tested?
Added tests in `JDBCSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-24327

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21379


commit 8d97b0deb5ed96094f70f16376b677fe3ff1bdfc
Author: Takeshi Yamamuro 
Date:   2018-05-21T03:36:16Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...

2018-05-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21331#discussion_r189488666
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala
 ---
@@ -85,6 +87,14 @@ object Canonicalize {
 case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r)
 case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
 
+// order the list in the In operator
+// we can do this only if all the elements in the list are literals 
with the same datatype
+case i @ In(value, list)
+if i.inSetConvertible && 
list.map(_.dataType.asNullable).distinct.size == 1 =>
+  val literals = list.map(_.asInstanceOf[Literal])
+  val ordering = 
TypeUtils.getInterpretedOrdering(literals.head.dataType)
--- End diff --

For non-ordering type, this will throw match error.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...

2018-05-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21331#discussion_r189488893
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala
 ---
@@ -85,6 +87,14 @@ object Canonicalize {
 case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r)
 case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
 
+// order the list in the In operator
+// we can do this only if all the elements in the list are literals 
with the same datatype
+case i @ In(value, list)
--- End diff --

Can't we just reorder elements in list by `hashCode`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...

2018-05-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21363#discussion_r189487172
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -143,6 +145,12 @@ object DateTimeUtils {
 millisLocal - getOffsetFromLocalMillis(millisLocal, timeZone)
   }
 
+  def dateTimeToMicroseconds(localDateTime: LocalDateTime, timeZone: 
TimeZone): Long = {
+val microOfSecond = localDateTime.getLong(ChronoField.MICRO_OF_SECOND)
+val epochSecond = 
localDateTime.atZone(timeZone.toZoneId).toInstant.getEpochSecond
+epochSecond * 100L + microOfSecond
--- End diff --

`100L` -> `MICROS_PER_SECOND`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21372
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21372
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90869/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21372
  
**[Test build #90869 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90869/testReport)**
 for PR 21372 at commit 
[`700872d`](https://github.com/apache/spark/commit/700872de9f928c288751831315948367d2dc50f6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20795: [SPARK-23486]cache the function name from the ext...

2018-05-20 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20795#discussion_r189485634
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/LookupFunctionsSuite.scala
 ---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import java.net.URI
+
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, 
InMemoryCatalog, SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions.Alias
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.internal.SQLConf
+
+class LookupFunctionsSuite extends PlanTest {
+
+  test("SPARK-23486: LookupFunctions should not check the same function 
name more than once") {
+val externalCatalog = new CustomInMemoryCatalog
+val analyzer = {
+  val conf = new SQLConf()
+  val catalog = new SessionCatalog(externalCatalog, 
FunctionRegistry.builtin, conf)
+  catalog.createDatabase(
+CatalogDatabase("default", "", new URI("loc"), Map.empty),
+ignoreIfExists = false)
+  new Analyzer(catalog, conf)
+}
+
+def table(ref: String): LogicalPlan = 
UnresolvedRelation(TableIdentifier(ref))
+val unresolvedFunc = UnresolvedFunction("func", Seq.empty, false)
+val plan = Project(
+  Seq(Alias(unresolvedFunc, "call1")(), Alias(unresolvedFunc, 
"call2")(),
+Alias(unresolvedFunc, "call1")()),
--- End diff --

nit: call3?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21192
  
It's stuck. I kept pinging and gave up.

> Do users need to understand the array option string format and parse it 
themselves?

Yea, I think this is the reason why it's stuck. I believe there's no easier 
other option. It's at least not a custom format but a standard JSON.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21356: [SPARK-24309][CORE] AsyncEventQueue should stop on inter...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21356
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21356: [SPARK-24309][CORE] AsyncEventQueue should stop on inter...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21356
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3400/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189483903
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
+and html table will feedback the queries user have defined (see
--- End diff --

Got it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189483894
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
--- End diff --

Copy.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21363#discussion_r189483592
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -119,7 +119,6 @@ class CSVOptions(
   val positiveInf = parameters.getOrElse("positiveInf", "Inf")
   val negativeInf = parameters.getOrElse("negativeInf", "-Inf")
 
-
--- End diff --

Sounds unrelated change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3399/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21356: [SPARK-24309][CORE] AsyncEventQueue should stop on inter...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21356
  
**[Test build #90873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90873/testReport)**
 for PR 21356 at commit 
[`09d55af`](https://github.com/apache/spark/commit/09d55afa4167460e732b2f4acb3cdde6029cf952).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21363#discussion_r189483493
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParserSuite.scala
 ---
@@ -107,20 +107,26 @@ class UnivocityParserSuite extends SparkFunSuite {
 assert(parser.makeConverter("_1", BooleanType, options = 
options).apply("true") == true)
 
 val timestampsOptions =
-  new CSVOptions(Map("timestampFormat" -> "dd/MM/ hh:mm"), "GMT")
+  new CSVOptions(Map("timestampFormat" -> "dd/MM/ HH:mm"), "GMT")
 val customTimestamp = "31/01/2015 00:00"
-val expectedTime = 
timestampsOptions.timestampFormat.parse(customTimestamp).getTime
+
+val expectedTime = LocalDateTime.parse(customTimestamp, 
timestampsOptions.timestampFormatter)
+  .atZone(options.timeZone.toZoneId)
+  .toInstant.toEpochMilli
 val castedTimestamp =
-  parser.makeConverter("_1", TimestampType, nullable = true, options = 
timestampsOptions)
+  parser.makeConverter("_1", TimestampType, nullable = true, 
timestampsOptions)
 .apply(customTimestamp)
 assert(castedTimestamp == expectedTime * 1000L)
 
-val customDate = "31/01/2015"
 val dateOptions = new CSVOptions(Map("dateFormat" -> "dd/MM/"), 
"GMT")
-val expectedDate = dateOptions.dateFormat.parse(customDate).getTime
+val customDate = "31/01/2015"
+
+val expectedDate = LocalDate.parse(customDate, 
dateOptions.dateFormatter)
+  .atStartOfDay(options.timeZone.toZoneId)
+  .toInstant.toEpochMilli
 val castedDate =
-  parser.makeConverter("_1", DateType, nullable = true, options = 
dateOptions)
--- End diff --

I would keep this line as was. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21363
  
@sergey-rubtsov, would we be able to add a configuration to control this 
behaviour? Sounds we should better have a configuration to control this 
behaviour for now since the date / timestamp parsing logic is affected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21356: [SPARK-24309][CORE] AsyncEventQueue should stop o...

2018-05-20 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21356#discussion_r189483397
  
--- Diff: core/src/main/scala/org/apache/spark/util/ListenerBus.scala ---
@@ -80,6 +89,11 @@ private[spark] trait ListenerBus[L <: AnyRef, E] extends 
Logging {
   }
   try {
 doPostEvent(listener, event)
+if (Thread.interrupted()) {
--- End diff --

> This is ok right now since Spark code never explicitly interrupts these 
threads. If we ever need to do that, though, this might become a problem... but 
in that case I don't know how you'd handle this issue without just giving up 
and stopping everything.

If spark were to explicitly interrupt, then I think we'd also set some 
other flag indicating a reason, eg. `val requestedQueueStop: AtomicBoolean` so 
it shouldn't be hard to distinguish.

I've pushed an update to handle `InterruptedException` as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21350: [SPARK-24303][PYTHON] Update cloudpickle to v0.4.4

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21350
  
Thank you @felixcheung, @ueshin and @BryanCutler for reviewing this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21370
  
**[Test build #90872 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90872/testReport)**
 for PR 21370 at commit 
[`f2bb8f3`](https://github.com/apache/spark/commit/f2bb8f334631734869ddf5d8ef1eca1fa29d334a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3398/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21370
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...

2018-05-20 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21331#discussion_r189483055
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala
 ---
@@ -85,6 +87,14 @@ object Canonicalize {
 case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r)
 case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
 
+// order the list in the In operator
+// we can do this only if all the elements in the list are literals 
with the same datatype
+case i @ In(value, list)
+if i.inSetConvertible && 
list.map(_.dataType.asNullable).distinct.size == 1 =>
+  val literals = list.map(_.asInstanceOf[Literal])
+  val ordering = 
TypeUtils.getInterpretedOrdering(literals.head.dataType)
+  In(value, literals.sortBy(_.value)(ordering))
--- End diff --

Thanks. BTW, it comes from [your 
example](https://github.com/apache/spark/pull/21331/files#r189407673). Anyway, 
my bad.  :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/21370
  
```
this will need to escape the values to make sure it is legal html too right?
```
Yes you're right, thanks for your guidance, the new patch consider the 
escape and add new UT.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189483025
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
+and html table will feedback the queries user have defined (see
--- End diff --

HTML


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r189483017
  
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also 
available, and may be useful
 from JVM to Python worker for every task.
   
 
+
+  spark.jupyter.eagerEval.enabled
+  false
+  
+Open eager evaluation on jupyter or not. If yes, dataframe will be ran 
automatically
--- End diff --

Jjupyter


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21370
  
**[Test build #90871 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90871/testReport)**
 for PR 21370 at commit 
[`ebc0b11`](https://github.com/apache/spark/commit/ebc0b11fd006386d32949f56228e2671297373fc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21342: [SPARK-24294] Throw SparkException when OOM in Br...

2018-05-20 Thread jinxing64

Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21342#discussion_r189482538
  
--- Diff: 
core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryException.java ---
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.memory;
+
+import org.apache.spark.annotation.Private;
+
+/**
+ * SPARK-24294: To bypass scala bug: 
https://github.com/scala/bug/issues/9554, we catch
+ * {@link OutOfMemoryError} in {@link scala.concurrent.Future}'s body, and 
re-throw
+ * SparkOutOfMemoryException instead.
+ */
+@Private
+public final class SparkOutOfMemoryException extends Exception {
+
+  private OutOfMemoryError oe;
--- End diff --

@felixcheung  thanks for review.
In current change there is no SparkOutOfMemoryException. I wrap fatal 
Throwable in SparkFatalException


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21370
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21294: [SPARK-24197][SparkR][SQL] Adding array_sort function to...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21294
  
Sorry I am late. My internet connection was limited. Not sure why too. 
FWIW, AppVeyor tests the latest R version and Jenkins has old R version ... I 
ran the test in my local but wasn't able to reproduce it too.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21374
  
hm, I think we better fix these in a batch or when the release is close ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21368: [SPARK-16451][repl] Fail shell if SparkSession fa...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21368#discussion_r189480728
  
--- Diff: python/pyspark/shell.py ---
@@ -38,25 +41,29 @@
 SparkContext._ensure_initialized()
 
 try:
-# Try to access HiveConf, it will raise exception if Hive is not added
-conf = SparkConf()
-if conf.get('spark.sql.catalogImplementation', 'hive').lower() == 
'hive':
-SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
-spark = SparkSession.builder\
-.enableHiveSupport()\
-.getOrCreate()
-else:
+try:
+# Try to access HiveConf, it will raise exception if Hive is not 
added
+conf = SparkConf()
+if conf.get('spark.sql.catalogImplementation', 'hive').lower() == 
'hive':
+SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf()
+spark = SparkSession.builder\
+.enableHiveSupport()\
+.getOrCreate()
+else:
+spark = SparkSession.builder.getOrCreate()
+except py4j.protocol.Py4JError:
+if conf.get('spark.sql.catalogImplementation', '').lower() == 
'hive':
+warnings.warn("Fall back to non-hive support because failing 
to access HiveConf, "
+  "please make sure you build spark with hive")
+spark = SparkSession.builder.getOrCreate()
+except TypeError:
+if conf.get('spark.sql.catalogImplementation', '').lower() == 
'hive':
+warnings.warn("Fall back to non-hive support because failing 
to access HiveConf, "
+  "please make sure you build spark with hive")
 spark = SparkSession.builder.getOrCreate()
-except py4j.protocol.Py4JError:
-if conf.get('spark.sql.catalogImplementation', '').lower() == 'hive':
-warnings.warn("Fall back to non-hive support because failing to 
access HiveConf, "
-  "please make sure you build spark with hive")
-spark = SparkSession.builder.getOrCreate()
-except TypeError:
-if conf.get('spark.sql.catalogImplementation', '').lower() == 'hive':
-warnings.warn("Fall back to non-hive support because failing to 
access HiveConf, "
-  "please make sure you build spark with hive")
-spark = SparkSession.builder.getOrCreate()
+except Exception as e:
+print("Failed to initialize Spark session:", e, file=sys.stderr)
--- End diff --

For consistency, it sounds better to print out traceback here too likewise 
with Scala side?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21349: [MINOR][PROJECT-INFRA] Check if 'original_head' v...

2018-05-20 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21349


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21363
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21363
  
**[Test build #90870 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90870/testReport)**
 for PR 21363 at commit 
[`65179a2`](https://github.com/apache/spark/commit/65179a2bdd1623fb7f4077cdc316de5a7436c49d).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-05-20 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21363
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90870/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21349: [MINOR][PROJECT-INFRA] Check if 'original_head' variable...

2018-05-20 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21349
  
Thanks for reviewing this @felixcheung. Will make the change to 
spark-webisite too in this week.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...

2018-05-20 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21363
  
**[Test build #90870 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90870/testReport)**
 for PR 21363 at commit 
[`65179a2`](https://github.com/apache/spark/commit/65179a2bdd1623fb7f4077cdc316de5a7436c49d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 306 matches

Mail list logo