[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90871/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #90871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90871/testReport)** for PR 21370 at commit [`ebc0b11`](https://github.com/apache/spark/commit/ebc0b11fd006386d32949f56228e2671297373fc). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90872/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #90872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90872/testReport)** for PR 21370 at commit [`f2bb8f3`](https://github.com/apache/spark/commit/f2bb8f334631734869ddf5d8ef1eca1fa29d334a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21316#discussion_r189496880 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1607,7 +1607,9 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def reduce(func: (T, T) => T): T = rdd.reduce(func) + def reduce(func: (T, T) => T): T = withNewExecutionId { --- End diff -- @rxin Sorry, I missed that comment. I just thought what `reduce` can do is what type-safe `groupByKey` can do. But, probably I missed something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189493817 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically +and html table will feedback the queries user have defined (see +https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for more details). + + + + spark.jupyter.default.showRows --- End diff -- `spark.jupyter.eagerEval.showRows` or something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189493455 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -358,6 +357,43 @@ class Dataset[T] private[sql]( sb.toString() } + /** + * Transform current row string and append to builder + * + * @param row Current row of string + * @param truncate If set to more than 0, truncates strings to `truncate` characters and + *all cells will be aligned right. + * @param colWidths The width of each column + * @param html If set to true, return output as html table. + * @param head Set to true while current row is table head. + * @param sbStringBuilder for current row. + */ + private[sql] def appendRowString( + row: Seq[String], + truncate: Int, + colWidths: Array[Int], + html: Boolean, + head: Boolean, + sb: StringBuilder): Unit = { +val data = row.zipWithIndex.map { case (cell, i) => + if (truncate > 0) { +StringUtils.leftPad(cell, colWidths(i)) + } else { +StringUtils.rightPad(cell, colWidths(i)) + } +} +(html, head) match { + case (true, true) => +data.map(StringEscapeUtils.escapeHtml).addString( + sb, "", "", "") --- End diff -- nit: add `\n`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189495783 --- Diff: python/pyspark/sql/dataframe.py --- @@ -78,6 +78,12 @@ def __init__(self, jdf, sql_ctx): self.is_cached = False self._schema = None # initialized lazily self._lazy_rdd = None +self._eager_eval = sql_ctx.getConf( +"spark.jupyter.eagerEval.enabled", "false").lower() == "true" +self._default_console_row = int(sql_ctx.getConf( +"spark.jupyter.default.showRows", u"20")) +self._default_console_truncate = int(sql_ctx.getConf( +"spark.jupyter.default.showRows", u"20")) --- End diff -- `spark.jupyter.default.truncate`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189493218 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -292,31 +297,25 @@ class Dataset[T] private[sql]( } // Create SeparateLine - val sep: String = colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString() + val sep: String = if (html) { +// Initial append table label +sb.append("\n") +"\n" + } else { +colWidths.map("-" * _).addString(sb, "+", "+", "+\n").toString() + } // column names - rows.head.zipWithIndex.map { case (cell, i) => -if (truncate > 0) { - StringUtils.leftPad(cell, colWidths(i)) -} else { - StringUtils.rightPad(cell, colWidths(i)) -} - }.addString(sb, "|", "|", "|\n") - + appendRowString(rows.head, truncate, colWidths, html, true, sb) sb.append(sep) // data - rows.tail.foreach { -_.zipWithIndex.map { case (cell, i) => - if (truncate > 0) { -StringUtils.leftPad(cell.toString, colWidths(i)) - } else { -StringUtils.rightPad(cell.toString, colWidths(i)) - } -}.addString(sb, "|", "|", "|\n") + rows.tail.foreach { row => +appendRowString(row.map(_.toString), truncate, colWidths, html, false, sb) --- End diff -- We don't need `.map(_.toString)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189494064 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +353,18 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob """ if isinstance(truncate, bool) and truncate: -print(self._jdf.showString(n, 20, vertical)) +print(self._jdf.showString(n, 20, vertical, False)) else: -print(self._jdf.showString(n, int(truncate), vertical)) +print(self._jdf.showString(n, int(truncate), vertical, False)) def __repr__(self): return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +def _repr_html_(self): +if self._eager_eval: +return self._jdf.showString( +self._default_console_row, self._default_console_truncate, False, True) --- End diff -- What will be shown if `spark.jupyter.eagerEval.enabled` is `False`? Fallback the original automatically? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189495803 --- Diff: python/pyspark/sql/dataframe.py --- @@ -78,6 +78,12 @@ def __init__(self, jdf, sql_ctx): self.is_cached = False self._schema = None # initialized lazily self._lazy_rdd = None +self._eager_eval = sql_ctx.getConf( +"spark.jupyter.eagerEval.enabled", "false").lower() == "true" +self._default_console_row = int(sql_ctx.getConf( +"spark.jupyter.default.showRows", u"20")) +self._default_console_truncate = int(sql_ctx.getConf( +"spark.jupyter.default.showRows", u"20")) --- End diff -- I guess we shouldn't hold these three values but extract as `@property` or refer each time in `_repr_html_`. Otherwise, we'll hit unexpected behavior, e.g.: ```python df = ... spark.conf.set("spark.jupyter.eagerEval.enabled", True) df ``` won't show the html. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189496454 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically --- End diff -- `true` instead of `yes`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189493846 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically +and html table will feedback the queries user have defined (see +https://issues.apache.org/jira/browse/SPARK-24215;>SPARK-24215 for more details). + + + + spark.jupyter.default.showRows + 20 + +Default number of rows in jupyter html table. + + + + spark.jupyter.default.truncate --- End diff -- `spark.jupyter.eagerEval.truncate` or something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189493461 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -358,6 +357,43 @@ class Dataset[T] private[sql]( sb.toString() } + /** + * Transform current row string and append to builder + * + * @param row Current row of string + * @param truncate If set to more than 0, truncates strings to `truncate` characters and + *all cells will be aligned right. + * @param colWidths The width of each column + * @param html If set to true, return output as html table. + * @param head Set to true while current row is table head. + * @param sbStringBuilder for current row. + */ + private[sql] def appendRowString( + row: Seq[String], + truncate: Int, + colWidths: Array[Int], + html: Boolean, + head: Boolean, + sb: StringBuilder): Unit = { +val data = row.zipWithIndex.map { case (cell, i) => + if (truncate > 0) { +StringUtils.leftPad(cell, colWidths(i)) + } else { +StringUtils.rightPad(cell, colWidths(i)) + } +} +(html, head) match { + case (true, true) => +data.map(StringEscapeUtils.escapeHtml).addString( + sb, "", "", "") + case (true, false) => +data.map(StringEscapeUtils.escapeHtml).addString( + sb, "", "", "") --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189493854 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +if (SQLConf.get.nestedSchemaPruningEnabled) { + apply0(plan) +} else { + plan +} + + private def apply0(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, partitionSchema, +dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) => +val projectionFields = projects.flatMap(getFields) +val filterFields = filters.flatMap(getFields) +val requestedFields = (projectionFields ++ filterFields).distinct + +// If [[requestedFields]] includes a nested field, continue. Otherwise, +// return [[op]] +if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) { + val prunedSchema = requestedFields +.map { case (field, _) => StructType(Array(field)) } +.reduceLeft(_ merge _) + val dataSchemaFieldNames = dataSchema.fieldNames.toSet + val prunedDataSchema = +StructType(prunedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) + + // If the data schema is different from the pruned data schema, continue. Otherwise, + // return [[op]]. We effect this comparison by counting the number of "leaf" fields in + // each schemata, assuming the fields in [[prunedDataSchema]] are a subset of the fields + // in [[dataSchema]]. + if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) { +val prunedParquetRelation = + hadoopFsRelation.copy(dataSchema = prunedDataSchema)(hadoopFsRelation.sparkSession) + +// We need to replace the expression ids of the pruned relation output attributes +// with the expression ids of the original relation output attributes so that +// references to the original relation's output are not broken +val outputIdMap = l.output.map(att => (att.name, att.exprId)).toMap +val prunedRelationOutput = + prunedParquetRelation +.schema +.toAttributes +.map { + case att if outputIdMap.contains(att.name) => +att.withExprId(outputIdMap(att.name)) + case att => att +} +val prunedRelation = + l.copy(relation = prunedParquetRelation, output = prunedRelationOutput) +
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189493986 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +if (SQLConf.get.nestedSchemaPruningEnabled) { + apply0(plan) +} else { + plan +} + + private def apply0(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, partitionSchema, +dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) => +val projectionFields = projects.flatMap(getFields) +val filterFields = filters.flatMap(getFields) +val requestedFields = (projectionFields ++ filterFields).distinct + +// If [[requestedFields]] includes a nested field, continue. Otherwise, +// return [[op]] +if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) { + val prunedSchema = requestedFields +.map { case (field, _) => StructType(Array(field)) } +.reduceLeft(_ merge _) + val dataSchemaFieldNames = dataSchema.fieldNames.toSet + val prunedDataSchema = +StructType(prunedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) + + // If the data schema is different from the pruned data schema, continue. Otherwise, + // return [[op]]. We effect this comparison by counting the number of "leaf" fields in + // each schemata, assuming the fields in [[prunedDataSchema]] are a subset of the fields + // in [[dataSchema]]. + if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) { +val prunedParquetRelation = + hadoopFsRelation.copy(dataSchema = prunedDataSchema)(hadoopFsRelation.sparkSession) + +// We need to replace the expression ids of the pruned relation output attributes +// with the expression ids of the original relation output attributes so that +// references to the original relation's output are not broken +val outputIdMap = l.output.map(att => (att.name, att.exprId)).toMap +val prunedRelationOutput = + prunedParquetRelation +.schema +.toAttributes +.map { + case att if outputIdMap.contains(att.name) => +att.withExprId(outputIdMap(att.name)) + case att => att +} +val prunedRelation = + l.copy(relation = prunedParquetRelation, output = prunedRelationOutput) +
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189491559 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -879,6 +879,15 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("select function over nested data") { --- End diff -- Without this PR, this test still can pass, right? Could you submit a separate PR for these test coverage improvement? We really welcome the test coverage improvement PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189491063 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala --- @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.types.StructType + +/** + * An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when + * debugging a physical query plan. + */ +private[sql] trait ColumnarFileFormat { --- End diff -- Can we do this in a separate PR? No need to block this PR due to the discussion about this implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189489061 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -162,7 +162,9 @@ case class FilterExec(condition: Expression, child: SparkPlan) val generatedIsNotNullChecks = new Array[Boolean](notNullPreds.length) val generated = otherPreds.map { c => val nullChecks = c.references.map { r => -val idx = notNullPreds.indexWhere { n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)} +val idx = notNullPreds.indexWhere { n => + n.asInstanceOf[IsNotNull].child.references.contains(r) --- End diff -- Is this change related? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189489577 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -99,27 +100,28 @@ trait ConstraintHelper { } /** - * Infer the Attribute-specific IsNotNull constraints from the null intolerant child expressions - * of constraints. + * Infer the Attribute and ExtractValue-specific IsNotNull constraints from the null intolerant + * child expressions of constraints. */ private def inferIsNotNullConstraints(constraint: Expression): Seq[Expression] = constraint match { // When the root is IsNotNull, we can push IsNotNull through the child null intolerant // expressions - case IsNotNull(expr) => scanNullIntolerantAttribute(expr).map(IsNotNull(_)) + case IsNotNull(expr) => scanNullIntolerantField(expr).map(IsNotNull(_)) // Constraints always return true for all the inputs. That means, null will never be returned. // Thus, we can infer `IsNotNull(constraint)`, and also push IsNotNull through the child // null intolerant expressions. - case _ => scanNullIntolerantAttribute(constraint).map(IsNotNull(_)) + case _ => scanNullIntolerantField(constraint).map(IsNotNull(_)) } /** - * Recursively explores the expressions which are null intolerant and returns all attributes - * in these expressions. + * Recursively explores the expressions which are null intolerant and returns all attributes and + * complex type extractors in these expressions. */ - private def scanNullIntolerantAttribute(expr: Expression): Seq[Attribute] = expr match { + private def scanNullIntolerantField(expr: Expression): Seq[Expression] = expr match { +case ev: ExtractValue => Seq(ev) --- End diff -- For this improvement, can we do it in a separate PR? The corresponding unit test case are needed in `InferFiltersFromConstraintsSuite` instead of `ParquetSchemaPruningSuite`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189491217 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -286,7 +286,19 @@ case class FileSourceScanExec( } getOrElse { metadata } -withOptPartitionCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +SparkSession + .getActiveSession + .map { sparkSession => +val columnCount = columnar.columnCountForSchema(sparkSession, requiredSchema) +withOptPartitionCount + ("ColumnCount" -> columnCount.toString) --- End diff -- This needs to be in a separate PR as I suggested above. BTW, we could easily lose this metadata if this change does not have a test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189479383 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1256,8 +1256,18 @@ object SQLConf { "issues. Turn on this config to insert a local sort before actually doing repartition " + "to generate consistent repartition results. The performance of repartition() may go " + "down since we insert extra local sort before it.") +.booleanConf +.createWithDefault(true) + + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data. Currently Parquet is the only data source that " + --- End diff -- How about ORC? cc @dongjoon-hyun Do you know whether it is also doable in the latest ORC version? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r189492534 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.NamedExpression +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types._ + +// scalastyle:off line.size.limit +class SelectedFieldSuite extends SparkFunSuite with BeforeAndAfterAll { + // The test schema as a tree string, i.e. `schema.treeString` + // root + // |-- col1: string (nullable = false) + // |-- col2: struct (nullable = true) + // ||-- field1: integer (nullable = true) + // ||-- field2: array (nullable = true) + // |||-- element: integer (containsNull = false) + // ||-- field3: array (nullable = false) + // |||-- element: struct (containsNull = true) + // ||||-- subfield1: integer (nullable = true) + // ||||-- subfield2: integer (nullable = true) + // ||||-- subfield3: array (nullable = true) + // |||||-- element: integer (containsNull = true) + // ||-- field4: map (nullable = true) + // |||-- key: string + // |||-- value: struct (valueContainsNull = false) + // ||||-- subfield1: integer (nullable = true) + // ||||-- subfield2: array (nullable = true) + // |||||-- element: integer (containsNull = false) + // ||-- field5: array (nullable = false) + // |||-- element: struct (containsNull = true) + // ||||-- subfield1: struct (nullable = false) + // |||||-- subsubfield1: integer (nullable = true) + // |||||-- subsubfield2: integer (nullable = true) + // ||||-- subfield2: struct (nullable = true) + // |||||-- subsubfield1: struct (nullable = true) + // ||||||-- subsubsubfield1: string (nullable = true) + // |||||-- subsubfield2: integer (nullable = true) + // ||-- field6: struct (nullable = true) + // |||-- subfield1: string (nullable = false) + // |||-- subfield2: string (nullable = true) + // ||-- field7: struct (nullable = true) + // |||-- subfield1: struct (nullable = true) + // ||||-- subsubfield1: integer (nullable = true) + // ||||-- subsubfield2: integer (nullable = true) + // ||-- field8: map (nullable = true) + // |||-- key: string + // |||-- value: array (valueContainsNull = false) + // ||||-- element: struct (containsNull = true) + // |||||-- subfield1: integer (nullable = true) + // |||||-- subfield2: array (nullable = true) + // ||||||-- element: integer (containsNull = false) + // ||-- field9: map (nullable = true) + // |||-- key: string + // |||-- value: integer (valueContainsNull = false) + // |-- col3: array (nullable = false) + // ||-- element: struct (containsNull = false) + // |||-- field1: struct (nullable = true) + // ||||-- subfield1: integer (nullable = false) + // ||||-- subfield2: integer (nullable = true) + // |||-- field2: map (nullable = true) + // ||||-- key: string + // ||||-- value: integer
[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21374 Also, I don't think we need a JIRA for each lint break. It's "virtually same" before after. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21374 Yea, so I am saying that we probably do this less frequently. LGTM btw if it fixes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90877/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90877/testReport)** for PR 21266 at commit [`fad31b7`](https://github.com/apache/spark/commit/fad31b7582266f96c8dba1eb83ab73f7aed893f8). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3406/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3405/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21193: [SPARK-24121][SQL] Add API for handling expression code ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21193 @cloud-fan @rednaxelafx Your last comments are addressed. Please check if you have other comments. Thanks for review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90879/testReport)** for PR 21266 at commit [`d8c308f`](https://github.com/apache/spark/commit/d8c308fa43a001328b8645e0d339875342c25c67). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20872: [SPARK-24328][SQL] Fix scala.MatchError in literals.sql....
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20872 @cloud-fan Could you check and resolve the jira? Thanks! https://issues.apache.org/jira/browse/SPARK-24328 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3404/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21288 **[Test build #90878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90878/testReport)** for PR 21288 at commit [`39e5a50`](https://github.com/apache/spark/commit/39e5a507fe22cade6bed0613eefbccab15cf45ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90875/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90875 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90875/testReport)** for PR 21266 at commit [`3b6f541`](https://github.com/apache/spark/commit/3b6f541d616d458fa90aa7e70d89d56dd41394f6). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3403/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90877/testReport)** for PR 21266 at commit [`fad31b7`](https://github.com/apache/spark/commit/fad31b7582266f96c8dba1eb83ab73f7aed893f8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21372 You've already checked if we have no performance difference, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3402/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21363#discussion_r189490878 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala --- @@ -59,13 +59,21 @@ class CSVInferSchemaSuite extends SparkFunSuite { assert(CSVInferSchema.inferField(IntegerType, textValueOne, options) == expectedTypeOne) } - test("Timestamp field types are inferred correctly via custom data format") { -var options = new CSVOptions(Map("timestampFormat" -> "-mm"), "GMT") + test("Timestamp field types are inferred correctly via custom date format") { +var options = new CSVOptions(Map("timestampFormat" -> "-MM"), "GMT") --- End diff -- Why we need to change this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r189490682 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -32,14 +32,14 @@ import org.apache.spark.util.{Benchmark, Utils} */ object FilterPushdownBenchmark { val conf = new SparkConf() - conf.set("orc.compression", "snappy") - conf.set("spark.sql.parquet.compression.codec", "snappy") +.setMaster("local[1]") --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90875/testReport)** for PR 21266 at commit [`3b6f541`](https://github.com/apache/spark/commit/3b6f541d616d458fa90aa7e70d89d56dd41394f6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20345 **[Test build #90876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90876/testReport)** for PR 20345 at commit [`94d9171`](https://github.com/apache/spark/commit/94d9171b8ec26c21724dd393cf4fc83ff52623e7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21374 Until now, @gatorsmile and @ueshin fixed these when we found. I am neutral on the policy. I would like to hear their opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21363#discussion_r189489932 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -140,14 +141,23 @@ private[csv] object CSVInferSchema { private def tryParseDouble(field: String, options: CSVOptions): DataType = { if ((allCatch opt field.toDouble).isDefined || isInfOrNan(field, options)) { DoubleType +} else { + tryParseDate(field, options) --- End diff -- Is this a behavior change? Previously timestamp type, now date type/timestamp type? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20345 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21379 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21379 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3401/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21379 @gatorsmile @conorbmurphy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21379 **[Test build #90874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90874/testReport)** for PR 21379 at commit [`8d97b0d`](https://github.com/apache/spark/commit/8d97b0deb5ed96094f70f16376b677fe3ff1bdfc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21379: [SPARK-24327][SQL] Add an option to quote a parti...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/21379 [SPARK-24327][SQL] Add an option to quote a partition column name in JDBCRelation ## What changes were proposed in this pull request? This pr added a new option to quote a partition column name in `JDBCRelation`. ## How was this patch tested? Added tests in `JDBCSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-24327 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21379.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21379 commit 8d97b0deb5ed96094f70f16376b677fe3ff1bdfc Author: Takeshi YamamuroDate: 2018-05-21T03:36:16Z Fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21331#discussion_r189488666 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala --- @@ -85,6 +87,14 @@ object Canonicalize { case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r) case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r) +// order the list in the In operator +// we can do this only if all the elements in the list are literals with the same datatype +case i @ In(value, list) +if i.inSetConvertible && list.map(_.dataType.asNullable).distinct.size == 1 => + val literals = list.map(_.asInstanceOf[Literal]) + val ordering = TypeUtils.getInterpretedOrdering(literals.head.dataType) --- End diff -- For non-ordering type, this will throw match error. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21331#discussion_r189488893 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala --- @@ -85,6 +87,14 @@ object Canonicalize { case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r) case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r) +// order the list in the In operator +// we can do this only if all the elements in the list are literals with the same datatype +case i @ In(value, list) --- End diff -- Can't we just reorder elements in list by `hashCode`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21363#discussion_r189487172 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -143,6 +145,12 @@ object DateTimeUtils { millisLocal - getOffsetFromLocalMillis(millisLocal, timeZone) } + def dateTimeToMicroseconds(localDateTime: LocalDateTime, timeZone: TimeZone): Long = { +val microOfSecond = localDateTime.getLong(ChronoField.MICRO_OF_SECOND) +val epochSecond = localDateTime.atZone(timeZone.toZoneId).toInstant.getEpochSecond +epochSecond * 100L + microOfSecond --- End diff -- `100L` -> `MICROS_PER_SECOND`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21372 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21372 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90869/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21372: [SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21372 **[Test build #90869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90869/testReport)** for PR 21372 at commit [`700872d`](https://github.com/apache/spark/commit/700872de9f928c288751831315948367d2dc50f6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20795: [SPARK-23486]cache the function name from the ext...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20795#discussion_r189485634 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/LookupFunctionsSuite.scala --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import java.net.URI + +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogDatabase, InMemoryCatalog, SessionCatalog} +import org.apache.spark.sql.catalyst.expressions.Alias +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.internal.SQLConf + +class LookupFunctionsSuite extends PlanTest { + + test("SPARK-23486: LookupFunctions should not check the same function name more than once") { +val externalCatalog = new CustomInMemoryCatalog +val analyzer = { + val conf = new SQLConf() + val catalog = new SessionCatalog(externalCatalog, FunctionRegistry.builtin, conf) + catalog.createDatabase( +CatalogDatabase("default", "", new URI("loc"), Map.empty), +ignoreIfExists = false) + new Analyzer(catalog, conf) +} + +def table(ref: String): LogicalPlan = UnresolvedRelation(TableIdentifier(ref)) +val unresolvedFunc = UnresolvedFunction("func", Seq.empty, false) +val plan = Project( + Seq(Alias(unresolvedFunc, "call1")(), Alias(unresolvedFunc, "call2")(), +Alias(unresolvedFunc, "call1")()), --- End diff -- nit: call3? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21192: [SPARK-24118][SQL] Flexible format for the lineSep optio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21192 It's stuck. I kept pinging and gave up. > Do users need to understand the array option string format and parse it themselves? Yea, I think this is the reason why it's stuck. I believe there's no easier other option. It's at least not a custom format but a standard JSON. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21356: [SPARK-24309][CORE] AsyncEventQueue should stop on inter...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21356 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21356: [SPARK-24309][CORE] AsyncEventQueue should stop on inter...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21356 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3400/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189483903 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically +and html table will feedback the queries user have defined (see --- End diff -- Got it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189483894 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically --- End diff -- Copy. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21363#discussion_r189483592 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -119,7 +119,6 @@ class CSVOptions( val positiveInf = parameters.getOrElse("positiveInf", "Inf") val negativeInf = parameters.getOrElse("negativeInf", "-Inf") - --- End diff -- Sounds unrelated change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3399/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21356: [SPARK-24309][CORE] AsyncEventQueue should stop on inter...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21356 **[Test build #90873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90873/testReport)** for PR 21356 at commit [`09d55af`](https://github.com/apache/spark/commit/09d55afa4167460e732b2f4acb3cdde6029cf952). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21363: [SPARK-19228][SQL] Migrate on Java 8 time from Fa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21363#discussion_r189483493 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParserSuite.scala --- @@ -107,20 +107,26 @@ class UnivocityParserSuite extends SparkFunSuite { assert(parser.makeConverter("_1", BooleanType, options = options).apply("true") == true) val timestampsOptions = - new CSVOptions(Map("timestampFormat" -> "dd/MM/ hh:mm"), "GMT") + new CSVOptions(Map("timestampFormat" -> "dd/MM/ HH:mm"), "GMT") val customTimestamp = "31/01/2015 00:00" -val expectedTime = timestampsOptions.timestampFormat.parse(customTimestamp).getTime + +val expectedTime = LocalDateTime.parse(customTimestamp, timestampsOptions.timestampFormatter) + .atZone(options.timeZone.toZoneId) + .toInstant.toEpochMilli val castedTimestamp = - parser.makeConverter("_1", TimestampType, nullable = true, options = timestampsOptions) + parser.makeConverter("_1", TimestampType, nullable = true, timestampsOptions) .apply(customTimestamp) assert(castedTimestamp == expectedTime * 1000L) -val customDate = "31/01/2015" val dateOptions = new CSVOptions(Map("dateFormat" -> "dd/MM/"), "GMT") -val expectedDate = dateOptions.dateFormat.parse(customDate).getTime +val customDate = "31/01/2015" + +val expectedDate = LocalDate.parse(customDate, dateOptions.dateFormatter) + .atStartOfDay(options.timeZone.toZoneId) + .toInstant.toEpochMilli val castedDate = - parser.makeConverter("_1", DateType, nullable = true, options = dateOptions) --- End diff -- I would keep this line as was. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21363 @sergey-rubtsov, would we be able to add a configuration to control this behaviour? Sounds we should better have a configuration to control this behaviour for now since the date / timestamp parsing logic is affected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21356: [SPARK-24309][CORE] AsyncEventQueue should stop o...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/21356#discussion_r189483397 --- Diff: core/src/main/scala/org/apache/spark/util/ListenerBus.scala --- @@ -80,6 +89,11 @@ private[spark] trait ListenerBus[L <: AnyRef, E] extends Logging { } try { doPostEvent(listener, event) +if (Thread.interrupted()) { --- End diff -- > This is ok right now since Spark code never explicitly interrupts these threads. If we ever need to do that, though, this might become a problem... but in that case I don't know how you'd handle this issue without just giving up and stopping everything. If spark were to explicitly interrupt, then I think we'd also set some other flag indicating a reason, eg. `val requestedQueueStop: AtomicBoolean` so it shouldn't be hard to distinguish. I've pushed an update to handle `InterruptedException` as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21350: [SPARK-24303][PYTHON] Update cloudpickle to v0.4.4
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21350 Thank you @felixcheung, @ueshin and @BryanCutler for reviewing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #90872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90872/testReport)** for PR 21370 at commit [`f2bb8f3`](https://github.com/apache/spark/commit/f2bb8f334631734869ddf5d8ef1eca1fa29d334a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3398/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/21331#discussion_r189483055 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala --- @@ -85,6 +87,14 @@ object Canonicalize { case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r) case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r) +// order the list in the In operator +// we can do this only if all the elements in the list are literals with the same datatype +case i @ In(value, list) +if i.inSetConvertible && list.map(_.dataType.asNullable).distinct.size == 1 => + val literals = list.map(_.asInstanceOf[Literal]) + val ordering = TypeUtils.getInterpretedOrdering(literals.head.dataType) + In(value, literals.sortBy(_.value)(ordering)) --- End diff -- Thanks. BTW, it comes from [your example](https://github.com/apache/spark/pull/21331/files#r189407673). Anyway, my bad. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/21370 ``` this will need to escape the values to make sure it is legal html too right? ``` Yes you're right, thanks for your guidance, the new patch consider the escape and add new UT. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189483025 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically +and html table will feedback the queries user have defined (see --- End diff -- HTML --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189483017 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.jupyter.eagerEval.enabled + false + +Open eager evaluation on jupyter or not. If yes, dataframe will be ran automatically --- End diff -- Jjupyter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #90871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90871/testReport)** for PR 21370 at commit [`ebc0b11`](https://github.com/apache/spark/commit/ebc0b11fd006386d32949f56228e2671297373fc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21342: [SPARK-24294] Throw SparkException when OOM in Br...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/21342#discussion_r189482538 --- Diff: core/src/main/java/org/apache/spark/memory/SparkOutOfMemoryException.java --- @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.memory; + +import org.apache.spark.annotation.Private; + +/** + * SPARK-24294: To bypass scala bug: https://github.com/scala/bug/issues/9554, we catch + * {@link OutOfMemoryError} in {@link scala.concurrent.Future}'s body, and re-throw + * SparkOutOfMemoryException instead. + */ +@Private +public final class SparkOutOfMemoryException extends Exception { + + private OutOfMemoryError oe; --- End diff -- @felixcheung thanks for review. In current change there is no SparkOutOfMemoryException. I wrap fatal Throwable in SparkFatalException --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21370 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21294: [SPARK-24197][SparkR][SQL] Adding array_sort function to...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21294 Sorry I am late. My internet connection was limited. Not sure why too. FWIW, AppVeyor tests the latest R version and Jenkins has old R version ... I ran the test in my local but wasn't able to reproduce it too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21374: [SPARK-24323][SQL] Fix lint-java errors
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21374 hm, I think we better fix these in a batch or when the release is close ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21368: [SPARK-16451][repl] Fail shell if SparkSession fa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21368#discussion_r189480728 --- Diff: python/pyspark/shell.py --- @@ -38,25 +41,29 @@ SparkContext._ensure_initialized() try: -# Try to access HiveConf, it will raise exception if Hive is not added -conf = SparkConf() -if conf.get('spark.sql.catalogImplementation', 'hive').lower() == 'hive': -SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf() -spark = SparkSession.builder\ -.enableHiveSupport()\ -.getOrCreate() -else: +try: +# Try to access HiveConf, it will raise exception if Hive is not added +conf = SparkConf() +if conf.get('spark.sql.catalogImplementation', 'hive').lower() == 'hive': +SparkContext._jvm.org.apache.hadoop.hive.conf.HiveConf() +spark = SparkSession.builder\ +.enableHiveSupport()\ +.getOrCreate() +else: +spark = SparkSession.builder.getOrCreate() +except py4j.protocol.Py4JError: +if conf.get('spark.sql.catalogImplementation', '').lower() == 'hive': +warnings.warn("Fall back to non-hive support because failing to access HiveConf, " + "please make sure you build spark with hive") +spark = SparkSession.builder.getOrCreate() +except TypeError: +if conf.get('spark.sql.catalogImplementation', '').lower() == 'hive': +warnings.warn("Fall back to non-hive support because failing to access HiveConf, " + "please make sure you build spark with hive") spark = SparkSession.builder.getOrCreate() -except py4j.protocol.Py4JError: -if conf.get('spark.sql.catalogImplementation', '').lower() == 'hive': -warnings.warn("Fall back to non-hive support because failing to access HiveConf, " - "please make sure you build spark with hive") -spark = SparkSession.builder.getOrCreate() -except TypeError: -if conf.get('spark.sql.catalogImplementation', '').lower() == 'hive': -warnings.warn("Fall back to non-hive support because failing to access HiveConf, " - "please make sure you build spark with hive") -spark = SparkSession.builder.getOrCreate() +except Exception as e: +print("Failed to initialize Spark session:", e, file=sys.stderr) --- End diff -- For consistency, it sounds better to print out traceback here too likewise with Scala side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21349: [MINOR][PROJECT-INFRA] Check if 'original_head' v...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21349 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21363 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21363 **[Test build #90870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90870/testReport)** for PR 21363 at commit [`65179a2`](https://github.com/apache/spark/commit/65179a2bdd1623fb7f4077cdc316de5a7436c49d). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21363 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90870/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21349: [MINOR][PROJECT-INFRA] Check if 'original_head' variable...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21349 Thanks for reviewing this @felixcheung. Will make the change to spark-webisite too in this week. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21363 **[Test build #90870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90870/testReport)** for PR 21363 at commit [`65179a2`](https://github.com/apache/spark/commit/65179a2bdd1623fb7f4077cdc316de5a7436c49d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org