[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/23152#discussion_r237719791 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { } def hasCountStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasCountStats) def hasDistinctCount(a: Attribute): Boolean = -get(a).map(_.distinctCount.isDefined).getOrElse(false) +get(a).exists(_.distinctCount.isDefined) def hasMinMaxStats(a: Attribute): Boolean = -get(a).map(_.hasCountStats).getOrElse(false) +get(a).exists(_.hasMinMaxStats) --- End diff -- @viirya This issue is hard to trigger because usually the count, distinct count, and min/max statistics appear together, and therefore `hasMinMaxStats` always return the right result. A numeric column with all null values is sufficient to reproduce this since we have the count and distinct count statistics but not min/max statistics in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/21879 @cloud-fan Didn't try to actually reproduce this issue in branches other than branch-2.3, but just by checking the POM files, this issue existed ever since at least 1.6. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/21879 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/21879 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/21879 The Kafka DStream test failures seem to be flaky and irrelevant. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21879: [SPARK-24927] The scope of snappy-java cannot be ...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/21879 [SPARK-24927] The scope of snappy-java cannot be "provided" Please see [SPARK-24927][1] for more details. [1]: https://issues.apache.org/jira/browse/SPARK-24927 Manually tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-24927 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21879.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21879 commit 93c34da713136eb7b4ed8bb8775353c8219efa22 Author: Cheng Lian Date: 2018-07-26T07:01:53Z The scope of snappy-java cannot be "provided" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/21865 test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/21865 add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/20174 LGTM, merging to master and branch-2.3. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160770992 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -666,4 +665,16 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { assert(exchangePlans.length == 1) } } + + Seq(true, false).foreach { codegen => +test("SPARK-22951: dropDuplicates on empty data frames should produce correct aggregate" + + s" results when codegen enabled: $codegen") { + withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, codegen.toString)) { +assert(Seq.empty[Int].toDF("a").count() == 0) +assert(Seq.empty[Int].toDF("a").agg(count("*")).count() == 1) +assert(spark.emptyDataFrame.dropDuplicates().count() == 0) + assert(spark.emptyDataFrame.dropDuplicates().agg(count("*")).count() == 1) --- End diff -- @liufengdb Maybe also add assertions to confirm that explicit global aggregations (by providing zero grouping keys) still return one row? For example: ```scala val emptyAgg = Map.empty[String, String] checkAnswer( spark.emptyDataFrame.agg(emptyAgg), Seq(Row()) ) checkAnswer( spark.emptyDataFrame.groupBy().agg(emptyAgg), Seq(Row()) ) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/20174 @mgaido91 We can't because we do not know whether there are any input rows or not. For example: ```scala val df1 = spark.range(10).select() val df2 = spark.range(10).filter($"id" < 0).select() val df3 = df1.dropDuplicates() val df4 = df2.dropDuplicates() ``` `df1` has zero columns and ten rows while `df2` has no columns and zero rows. Therefore, `df3` should return one row containing zero columns while `df4` should return zero rows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160612592 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] { Alias(new First(attr).toAggregateExpression(), attr.name)(attr.exprId) } } - Aggregate(keys, aggCols, child) + // SPARK-22951: the implementation of aggregate operator treats the cases with and without + // grouping keys differently, when there are not input rows. For the aggregation after + // `dropDuplicates()` on an empty data frame, a grouping key is added here to make sure the + // aggregate operator can work correctly (returning an empty iterator). --- End diff -- > SPARK-22951: Physical aggregate operators distinguishes global aggregation and grouping aggregations by checking the number of grouping keys. The key difference here is that a global aggregation always returns at least one row even if there are no input rows. Here we append a literal when the grouping key list is empty so that the result aggregate operator is properly treated as a grouping aggregation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160611832 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends Rule[LogicalPlan] { Alias(new First(attr).toAggregateExpression(), attr.name)(attr.exprId) } } - Aggregate(keys, aggCols, child) + // SPARK-22951: the implementation of aggregate operator treats the cases with and without + // grouping keys differently, when there are not input rows. For the aggregation after + // `dropDuplicates()` on an empty data frame, a grouping key is added here to make sure the + // aggregate operator can work correctly (returning an empty iterator). + val newKeys = if (keys.isEmpty) Literal(1) :: Nil else keys --- End diff -- Nit: Maybe rename `newKeys` to `nonemptyKeys`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/20174#discussion_r160602821 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -19,6 +19,8 @@ package org.apache.spark.sql import scala.util.Random +import org.apache.spark.sql.catalyst.expressions.Literal +import org.apache.spark.sql.catalyst.expressions.aggregate.Percentile --- End diff -- Remove unused imports? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/20174 @liufengdb I wrote a summary according to our offline discussion to explain the subtle change made in this PR. Please feel free to use it in the PR description if it looks good to you :) Same as all the other SQL query engines, Spark SQL supports both grouping aggregation, e.g.: ```sql SELECT count(*) FROM a_table GROUP BY key ``` and global aggregation, e.g.: ```sql SELECT count(*) FROM a_table ``` Essentially, global aggregation is a special case of grouping aggregation where the whole table is treated as a single group. A key difference, though, exists in the case where a group contains zero rows: - Grouping aggregation ```sql SELECT count(*) AS c FROM range(3) WHERE id < 0 GROUP BY id -- +---+ -- | c | -- +---+ -- +---+ ``` The above query returns zero rows. - Global aggregation ```sql SELECT count(*) AS c FROM range(3) WHERE id < 0 -- +---+ -- | c | -- +---+ -- | 0 | -- +---+ ``` The above query returns one row. To be more specific, global aggregation with zero input rows always return a single row with the initial aggregation state as the output. To tell whether an `Aggregate` operator `A` is a global aggregation or not, Spark SQL simply checks the number of grouping keys, and `A` is a global aggregation if it has zero grouping keys. However, this simple priciple drops the ball in the following case: ```scala spark.emptyDataFrame.dropDuplicates().agg(count($"*") as "c").show() // +---+ // | c | // +---+ // | 1 | // +---+ ``` The reason is that: 1. `df.dropDuplicates()` is roughly translated into something equivalent to: ``` val allColumns = df.columns.map { col } df.groupBy(allColumns: _*).agg(allColumns.head, allColumns.tail: _*) ``` This translation is implemented in the rule `ReplaceDeduplicateWithAggregate`. 2. `spark.emptyDataFrame` contains zero columns and zero rows. Therefore, rule `ReplaceDeduplicateWithAggregate` translates `spark.emptyDataFrame.dropDuplicates()` into something equivalent to: ```scala spark.emptyDataFrame.groupBy().agg(Map.empty[String, String]) ``` Which confuses Spark SQL and gets recognized as a global aggregation because the aggregate operator contains no grouping keys. As a result, Spark SQL allocates a single row filled by the initial aggregation state and uses it as the output, and returns a wrong result. To fix this issue, this PR tweaks `ReplaceDeduplicateWithAggregate` by appending a literal `1` to the grouping key list of the resulting `Aggregate` operator when the input plan contains zero output columns. In this way, `spark.emptyDataFrame.dropDuplicates()` is now translated into: ```scala spark.emptyDataFrame.dropDuplicates() => spark.emptyDataFrame.groupBy(lit(1)).agg(Map.empty[String, String]) ``` Which is now properly treated as a grouping aggregation and returns the correct answer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/19439 @jkbradley I'm not confident enough about this part but a quick check suggested that typically `PathFilter`s are used in `FileInputFormat.listStatus()`, which is usually called in `FileInputFormat.getSplits()` method, and `getSplits()` is used by Spark to determine RDD partitions on the driver side. That said, in this specific typical scenario, the behavior of `SamplePathFilter` should be deterministic. However, I'd say this assumption is fragile since `PathFilter`s are used in a pretty ad-hoc way throughout the whole Hadoop ecosystem and my impression is that `PathFilter`s themselves are expected to be deterministic in general. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19386: [SPARK-22161] [SQL] Add Impala-modified TPC-DS queries
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/19386 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19361: [SPARK-22140] Add TPCDSQuerySuite
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/19361#discussion_r141501351 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala --- @@ -0,0 +1,348 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.catalyst.util.resourceToString +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class TPCDSQuerySuite extends QueryTest with SharedSQLContext with BeforeAndAfterAll { + + /** + * Drop all the tables + */ + protected override def afterAll(): Unit = { +try { + spark.sessionState.catalog.reset() +} finally { + super.afterAll() +} + } + + override def beforeAll() { +super.beforeAll() +sql( + """ +|CREATE TABLE `catalog_page` ( +|`cp_catalog_page_sk` INT, `cp_catalog_page_id` STRING, `cp_start_date_sk` INT, +|`cp_end_date_sk` INT, `cp_department` STRING, `cp_catalog_number` INT, +|`cp_catalog_page_number` INT, `cp_description` STRING, `cp_type` STRING) +|USING parquet + """.stripMargin) + +sql( + """ +|CREATE TABLE `catalog_returns` ( +|`cr_returned_date_sk` INT, `cr_returned_time_sk` INT, `cr_item_sk` INT, +|`cr_refunded_customer_sk` INT, `cr_refunded_cdemo_sk` INT, `cr_refunded_hdemo_sk` INT, +|`cr_refunded_addr_sk` INT, `cr_returning_customer_sk` INT, `cr_returning_cdemo_sk` INT, +|`cr_returning_hdemo_sk` INT, `cr_returning_addr_sk` INT, `cr_call_center_sk` INT, +|`cr_catalog_page_sk` INT, `cr_ship_mode_sk` INT, `cr_warehouse_sk` INT, `cr_reason_sk` INT, +|`cr_order_number` INT, `cr_return_quantity` INT, `cr_return_amount` DECIMAL(7,2), +|`cr_return_tax` DECIMAL(7,2), `cr_return_amt_inc_tax` DECIMAL(7,2), `cr_fee` DECIMAL(7,2), +|`cr_return_ship_cost` DECIMAL(7,2), `cr_refunded_cash` DECIMAL(7,2), +|`cr_reversed_charge` DECIMAL(7,2), `cr_store_credit` DECIMAL(7,2), +|`cr_net_loss` DECIMAL(7,2)) +|USING parquet + """.stripMargin) + +sql( + """ +|CREATE TABLE `customer` ( +|`c_customer_sk` INT, `c_customer_id` STRING, `c_current_cdemo_sk` INT, +|`c_current_hdemo_sk` INT, `c_current_addr_sk` INT, `c_first_shipto_date_sk` INT, +|`c_first_sales_date_sk` INT, `c_salutation` STRING, `c_first_name` STRING, +|`c_last_name` STRING, `c_preferred_cust_flag` STRING, `c_birth_day` INT, +|`c_birth_month` INT, `c_birth_year` INT, `c_birth_country` STRING, `c_login` STRING, +|`c_email_address` STRING, `c_last_review_date` STRING) +|USING parquet + """.stripMargin) + +sql( + """ +|CREATE TABLE `customer_address` ( +|`ca_address_sk` INT, `ca_address_id` STRING, `ca_street_number` STRING, +|`ca_street_name` STRING, `ca_street_type` STRING, `ca_suite_number` STRING, +|`ca_city` STRING, `ca_county` STRING, `ca_state` STRING, `ca_zip` STRING, +|`ca_country` STRING, `ca_gmt_offset` DECIMAL(5,2), `ca_location_type` STRING) +|USING parquet + """.stripMargin) + +sql( + """ +|CREATE TABLE `customer_demographics` ( +|`cd_demo_sk` INT, `cd_gender` STRING, `cd_marital_status` STRING, +|`cd_education_status` STRING, `cd_purchase_estimate` INT, `cd_credit_rating` STRING, +|`cd_dep_count` INT, `cd_dep_employed_count` INT, `cd_dep_college_count` INT) +|USING parquet + """.stripMargin) + +sq
[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/19080#discussion_r136243745 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -30,18 +30,43 @@ import org.apache.spark.sql.types.{DataType, IntegerType} * - Intra-partition ordering of data: In this case the distribution describes guarantees made *about how tuples are distributed within a single partition. */ -sealed trait Distribution +sealed trait Distribution { + /** + * The required number of partitions for this distribution. If it's None, then any number of + * partitions is allowed for this distribution. + */ + def requiredNumPartitions: Option[Int] + + /** + * Creates a default partitioning for this distribution, which can satisfy this distribution while + * matching the given number of partitions. + */ + def createPartitioning(numPartitions: Int): Partitioning +} /** * Represents a distribution where no promises are made about co-location of data. */ -case object UnspecifiedDistribution extends Distribution +case object UnspecifiedDistribution extends Distribution { + override def requiredNumPartitions: Option[Int] = None + + override def createPartitioning(numPartitions: Int): Partitioning = { +throw new IllegalStateException("UnspecifiedDistribution does not have default partitioning.") + } +} /** * Represents a distribution that only has a single partition and all tuples of the dataset * are co-located. */ -case object AllTuples extends Distribution +case object AllTuples extends Distribution { --- End diff -- Yea, I did miss the point that `ClusteredDistribution` covers both `RangePartitioning` and `HashPartitioning`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/19080#discussion_r136211685 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -30,18 +30,43 @@ import org.apache.spark.sql.types.{DataType, IntegerType} * - Intra-partition ordering of data: In this case the distribution describes guarantees made *about how tuples are distributed within a single partition. */ -sealed trait Distribution +sealed trait Distribution { + /** + * The required number of partitions for this distribution. If it's None, then any number of + * partitions is allowed for this distribution. + */ + def requiredNumPartitions: Option[Int] + + /** + * Creates a default partitioning for this distribution, which can satisfy this distribution while + * matching the given number of partitions. + */ + def createPartitioning(numPartitions: Int): Partitioning +} /** * Represents a distribution where no promises are made about co-location of data. */ -case object UnspecifiedDistribution extends Distribution +case object UnspecifiedDistribution extends Distribution { + override def requiredNumPartitions: Option[Int] = None + + override def createPartitioning(numPartitions: Int): Partitioning = { +throw new IllegalStateException("UnspecifiedDistribution does not have default partitioning.") + } +} /** * Represents a distribution that only has a single partition and all tuples of the dataset * are co-located. */ -case object AllTuples extends Distribution +case object AllTuples extends Distribution { --- End diff -- I think the concern about logical/physical separation is only useful if we take CBO into consideration. E.g., for a `AllTuples` distribution requirement, the planner may produce two plans using `SinglePartition` and `BroadcastPartitioning` respectively and pick a cheaper one. In the scope of our current planner framework, this separation doesn't seem to be very useful, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18712: [SPARK-17528][SQL][followup] remove unnecessary data cop...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18712 Nice, didn't know that the copy issue has already been fixed. LGTM, merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17419 @WeichenXu123 and I did some profiling using `jvisualvm` and found that 40% of the time is spent in the copy performed by [this `safeProjection`][1]. This is a known issue used to fight against the false sharing issue @cloud-fan and I hit before. @cloud-fan tried to fix this issue in #15082 but that PR didn't work out due to some other concerns (I can't remember all the details now). @cloud-fan, any ideas about improving `ObjectHashAggregateExec` (e.g. adding code generation support)? [1]: https://github.com/apache/spark/blob/03367d7aa3acd3abbcacba57ea75c8efa2a9a794/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L154 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18470: [SPARK-21258][SQL] Fix WindowExec complex object aggrega...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18470 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18412: [SPARK-21203] [SQL] Fix wrong results of insertio...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/18412#discussion_r124062223 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -482,15 +482,15 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String case (fromField, toField) => cast(fromField.dataType, toField.dataType) } // TODO: Could be faster? -val newRow = new GenericInternalRow(from.fields.length) buildCast[InternalRow](_, row => { + val newRow = new GenericInternalRow(from.fields.length) var i = 0 while (i < row.numFields) { newRow.update(i, if (row.isNullAt(i)) null else castFuncs(i)(row.get(i, from.apply(i).dataType))) i += 1 } - newRow.copy() --- End diff -- @hvanhovell I found the semantics of the original `GenericMutableRow.copy()` pretty dangerous since it didn't properly implement deep copy semantics. In fact, it's impossible to do a proper deep copy there. We only copy the underlying field array but may still share field objects accidentally. If we do want to "fix" `GenericInternalRow.copy()`, I'd prefer throwing an exception instead of following the old `GenericMutableRow.copy()` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to ...
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/18181 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18181 @rdblue @mallman Thanks for the comments! As mentioned in the JIRA ticket, we've decided to preserve parquet-mr 1.8.2. Instead, we'll add a release notes entry to suggest using parquet-avro 1.8.1 instead of 1.8.2 to avoid the Avro dependency conflict. I'm closing this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18181 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18181 Unfortunately, rolling back parquet-mr to 1.8.1 brings back [PARQUET-389][1], which breaks multiple test cases involving schema evolution (add a new column to a Parquet table and filter on that column). Trying to figure out a workaround for this but haven't got any luck yet. [1]: https://issues.apache.org/jira/browse/PARQUET-389 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18181 @viirya Thanks for reminding! I'm reverting that one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/18181 @dongjoon-hyun I already reverted PR #16751 manually but forgot to mention it in the PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to ...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/18181 [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1 ## What changes were proposed in this pull request? This PR reverts PR #16791, #16817, and part of #16795 to roll back parquet-mr 1.8.2 to 1.8.1 to escape from a dependency hell caused by avro 1.7.7 and 1.8.1. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark rollback-parquet-mr-1.8.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18181.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18181 commit d03d4de8744e8a0eb4b6002b3f4a4e98be35eab6 Author: Cheng Lian <l...@databricks.com> Date: 2017-06-02T00:08:38Z Roll back parquet-mr 1.8.2 to 1.8.1 commit 6dcef66617d95fdffdff71d20a228863d88f97cf Author: Cheng Lian <l...@databricks.com> Date: 2017-06-02T00:10:59Z Update dependency manifests commit c15760bb49e59a0e71971e3a37cc03ffb5ffc60a Author: Cheng Lian <l...@databricks.com> Date: 2017-06-02T00:15:15Z Revert "[SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to bugs of old Parquet versions" This reverts commit 7730426cb95eec2652a9ea979ae2c4faf7e585f2. commit cf461d08ea9dda4a3220a5aecf1c22d66d432d99 Author: Cheng Lian <l...@databricks.com> Date: 2017-06-02T00:30:29Z Revert "[SPARK-17213][SQL][FOLLOWUP] Re-enable Parquet filter tests for binary and string" This reverts commit 0f16ff5b0ec8cd828774ba5ddb276d7b06dbe273. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17469: [SPARK-20132][Docs] Add documentation for column string ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17469 Also cherry-picked this to branch-2.2. cc @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17693: [SPARK-20314][SQL] Inconsistent error handling in JSON p...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17693 Not suggesting doing it in this PR but maybe adding a SQL option to let the users choose the error handling strategy of all the JSON functions probably makes more sense here? The Spark JSON data source allows users to choose a parsing mode among: - `PERMISSIVE`: replacing malformed records with nulls, - `DROPMALFORMED`: drop malformed records, and - `FAILFAST`: report an error and abort once a malformed record is found. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17398: [SPARK-19716][SQL] support by-name resolution for struct...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17398 LGTM. Merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17454: [SPARK-20125][SQL] Dataset of type option of map does no...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17454 OK, resolved the conflict manually and merged to branch-2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17454: [SPARK-20125][SQL] Dataset of type option of map does no...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17454 This PR conflicts with branch-2.1, trying to resolve manually while merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17454: [SPARK-20125][SQL] Dataset of type option of map does no...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17454 LGTM, merging to master and branch-2.1. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to b...
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/16917 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/17247#discussion_r105489225 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1865,4 +1865,12 @@ class HiveDDLSuite } } } + + test("SPARK-19905: Hive SerDe table input paths") { +withTable("spark_19905") { + spark.range(10).createOrReplaceTempView("spark_19905_view") + sql("CREATE TABLE spark_19905 STORED AS RCFILE AS SELECT * FROM spark_19905_view") + assert(spark.table("spark_19905").inputFiles.nonEmpty) --- End diff -- `input_file_name` and `Dataset.inputFiles` are different code paths. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/17247#discussion_r105475257 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2734,6 +2735,8 @@ class Dataset[T] private[sql]( fsBasedRelation.inputFiles case fr: FileRelation => fr.inputFiles + case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) => +r.tableMeta.storage.locationUri.map { _.toString }.toArray --- End diff -- Didn't use `r.tableMeta.location` here intentionally for safty. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles for Hiv...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17247 cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles ...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/17247 [SPARK-19905][SQL] Bring back Dataset.inputFiles for Hive SerDe tables ## What changes were proposed in this pull request? `Dataset.inputFiles` works by matching `FileRelation`s in the query plan. In Spark 2.1, Hive SerDe tables are represented by `MetastoreRelation`, which inherits from `FileRelation`. However, in Spark 2.2, Hive SerDe tables are now represented by `CatalogRelation`, which doesn't inherit from `FileRelation` anymore, due to the unification of Hive SerDe tables and data source tables. This change breaks `Dataset.inputFiles` for Hive SerDe tables. This PR tries to fix this issue by explicitly matching `CatalogRelation`s that are Hive SerDe tables in `Dataset.inputFiles`. Note that we can't make `CatalogRelation` inherit from `FileRelation` since not all `CatalogRelation`s are file based (e.g., JDBC data source tables). ## How was this patch tested? New test case added in `HiveDDLSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-19905-hive-table-input-files Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17247.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17247 commit 3e0abc48de8219e3daf839584ec874855ced1210 Author: Cheng Lian <l...@databricks.com> Date: 2017-03-10T19:42:17Z Bring back Dataset.inputFiles for Hive SerDe tables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17168: [SPARK-19737][SQL] New analysis rule for reporting unreg...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17168 Thanks for the review! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17168: [SPARK-19737][SQL] New analysis rule for reporting unreg...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17168 @cloud-fan I'm afraid that's even more complicated an approach since you'll need to make function builders, which are now simply regular Scala functions, into unresolved expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17168: [SPARK-19737][SQL] New analysis rule for reporting unreg...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/17168 cc @cloud-fan @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17168: [SPARK-19737][SQL] New analysis rule for reportin...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/17168 [SPARK-19737][SQL] New analysis rule for reporting unregistered functions without relying on relation resolution ## What changes were proposed in this pull request? This PR adds a new `Once` analysis rule batch consists of a single analysis rule `LookupFunctions` that performs simple existence check over `UnresolvedFunctions` without actually resolving them. The benefit of this rule is that it doesn't require function arguments to be resolved first and therefore doesn't rely on relation resolution, which may incur potentially expensive partition/schema discovery cost. Please refer to [SPARK-19737][1] for more details about the motivation. ## How was this patch tested? New test case added in `AnalysisErrorSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-19737-lookup-functions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17168 commit c9da94ce0d15333f30653f67c6a1af6bd0f10589 Author: Cheng Lian <l...@databricks.com> Date: 2017-03-05T06:16:26Z New analysis rule for detect undefined functions more quickly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16935 Merging to master an branch-2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16917 cc @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16866: [SPARK-19529] TransportClientFactory.createClient() shou...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16866 Merged to master, branch-2.1, and branch-2.0. Files involved in branch-1.6 were moved to new directories and made it hard to cherry-pick. Created PR #16917 to backport this one to 1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to b...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/16917 [SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1.6 ## What changes were proposed in this pull request? This PR backports PR #16866 to branch-1.6 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-19529-1.6-backport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16917.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16917 commit f7d7c54a6b0316d66dd03b3d712df7d33f7cf5b9 Author: Cheng Lian <l...@databricks.com> Date: 2017-02-13T22:12:42Z Backport PR #16866 to branch-1.6 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16866: [SPARK-19529] TransportClientFactory.createClient() shou...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16866 Branch-2.1 test compilation happens to be broken right now. Trying to fix the compilation failure first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16866: [SPARK-19529] TransportClientFactory.createClient() shou...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16866 Merging to master, branch-2.1, branch-2.0, and branch-1.6. Conflicts occurred, I'm fixing them manually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16161 Thanks. Backported to branch-2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16161 Shall we backport this to branch-2.1? I'd consider this as a bug because of the following snippet fail in Spark 2.1: ```scala case class Wrapper1(value: Option[Map[String, String]]) case class Wrapper2(value: Map[String, String]) val ds1 = Seq.empty[Wrapper1].toDS() // Fail val ds2 = Seq.empty[Wrapper2].toDS() // Succeed ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16795: [SPARK-19409][BUILD] Fix ParquetAvroCompatibilitySuite f...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16795 @dongjoon-hyun, you may add `[TEST-MAVEN]` in the PR title to ask Jenkins to test this PR using Maven. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16791 @HyukjinKwon Sorry that I didn't see your comment before this PR got merged. I believe PARQUET-686 had already been fixed by apache/parquet-mr#367 but wasn't marked as resolved in JIRA. Thanks for sending out #16817 for re-enabling the tests! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16791 I see, thanks for the context. But I'd like to keep this Maven build failure fix in a separate PR so that people can easily cherry-pick the fix. Also, it helps to keep this PR easier to follow. Could you please send another PR to fix the Maven failure? BTW, IIUC, this failure only affects Maven builds and probably that's why it wasn't caused by the Jenkins PR builder when testing #16751? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16791 Hope we finally have proper Parquet filter push-down for string/binary columns (cross fingers)! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16791 @dongjoon-hyun Actually, could you please point me the Maven build failure? I don't think this failure is caused by this PR, is it? Are you refering to some existing PR introduced by some earlier commit(s)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16791 @dongjoon-hyun Ah, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16791 cc @cloud-fan @rxin @HyukjinKwon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workar...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/16791 [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to bugs of old Parquet versions ## What changes were proposed in this pull request? We've already upgraded parquet-mr to 1.8.2. This PR does some further cleanup by removing a workaround of PARQUET-686 and a hack due to PARQUET-363 and PARQUET-278. All three Parquet issues are fixed in parquet-mr 1.8.2. ## How was this patch tested? Existing unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark parquet-1.8.2-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16791.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16791 commit 5d97b39d2c196ba30a5eaa0c5e782f6c115668fc Author: Cheng Lian <l...@databricks.com> Date: 2017-01-23T21:05:44Z Upgrade Parquet to 1.8.2-rc1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16649: [SPARK-19295] [SQL] IsolatedClientLoader's downloadVersi...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16649 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16564: [SPARK-19065][SS]Rewrite Alias in StreamExecution if nec...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16564 LGTM as long as we decide to preserve #15427. The root cause of this issue is still the `df("col")` syntax, which is the motivation behind #15427. We decided not to deprecate/remove this syntax to ensure backward compatibility but maybe we should reconsider that decision as it keeps causing issues and confuses users? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16424: [SPARK-19016][SQL][DOC] Document scalable partition hand...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16424 OK, I'm merging this to master and branch-2.1. Thanks for the review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16424: [SPARK-19016][SQL][DOC] Document scalable partition hand...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16424 @ericl @CodingCat Thanks for the review! Fixed per your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16424: [SPARK-19016][SQL][DOC] Document scalable partiti...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16424#discussion_r94181109 --- Diff: docs/sql-programming-guide.md --- @@ -526,11 +526,18 @@ By default `saveAsTable` will create a "managed table", meaning that the locatio be controlled by the metastore. Managed tables will also have their data deleted automatically when a table is dropped. -Currently, `saveAsTable` does not expose an API supporting the creation of an "External table" from a `DataFrame`, -however, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key -and location of the external table as its value (String) when saving the table with `saveAsTable`. When an External table +Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`, +however. This functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key --- End diff -- No... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16424: [SPARK-19016][SQL][DOC] Document scalable partiti...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/16424 [SPARK-19016][SQL][DOC] Document scalable partition handling ## What changes were proposed in this pull request? This PR documents the scalable partition handling feature in the body of the programming guide. Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1. ## How was this patch tested? N/A. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark scalable-partition-handling-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16424.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16424 commit 2e47d7da9251bd4d2dca6071db96f8b8a3dc6f1e Author: Cheng Lian <l...@databricks.com> Date: 2016-12-28T19:14:18Z Document scalable partition handling --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16184: [SPARK-18753][SQL] Keep pushed-down null literal as a fi...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16184 LGTM, thanks for working on this! I think the problem is that we generate a foldable predicate (`lit(null)`) during optimization phase but fail to fold it. Ideally, the optimizer should fold `IsNotNull($"_1") && lit(null)` into a `lit(false)`. But it's still good to check for foldable predicates at planning phase. Merging this to master and branch-2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 Another thing that @cloud-fan and I agreed upon is that our current `DataFrameReader.schema()` interface method is insufficient. We may want to add - `DataFrameReader.dataSchema()`, and - `DataFrameReader.partitoinSchema()` in the future to capture cases where data schema and partition schema overlap with each other. The reason is that `FileFormat.buildReader()` requires a `dataSchema` argument, which should be the full schema of the underlying data files and may contain one or more partition columns. The reason is that many data sources (e.g. CSV and LibSVM) simply don't support column pruning, and you have to read out all physical columns first and then project out unnecessary ones. Say we have a table with the following schema information: - Data schema: `[x, y, p1]` - Partition schema: `[p1, p2]` Say a user tries to read this table with a user-specified schema `[x, y, p1, p2]`. Now we've no idea whether `p1` is part of the data schema or not since we skip schema inference when a user-specified schema is provided. Although I haven't tested it extensively, I'd say it would be problematic if we try to use user-specified schema together with data sources without column pruning support (e.g. CSV) when the data schema overlaps with partition schema. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 @maropu @brkyvz Sorry for the delay, I was blocked by some other tasks during the last a few days. @cloud-fan and I just did some investigation and we think we came up with a minimal fix to this issue that also preserves the old behavior. Although we do agree that the new behavior makes more sense (moving all partition columns to the end of the output schema), it is a breaking change and we'd like to minimize risks for 2.1 release. @cloud-fan will give a summary of the new fix and reasons behind it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16163: [SPARK-18730] Post Jenkins test report page instead of t...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16163 @srowen Thanks. I sent this one because the `consoleFull` page frequently freezes my browser these days, not mentioning viewing Jenkins build results via mobile phone... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16163: [SPARK-18730] Post Jenkins test report page inste...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/16163 [SPARK-18730] Post Jenkins test report page instead of the full console output page to GitHub ## What changes were proposed in this pull request? Currently, the full console output page of a Spark Jenkins PR build can be as large as several megabytes. It takes a relatively long time to load and may even freeze the browser for quite a while. This PR makes the build script to post the test report page link to GitHub instead. The test report page is way more concise and is usually the first page I'd like to check when investigating a Jenkins build failure. ## How was this patch tested? N/A. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark jenkins-test-report Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16163.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16163 commit 6aa9f34fa2abd02ae07dea5c0a404d67f7ae5998 Author: Cheng Lian <l...@databricks.com> Date: 2016-12-06T00:44:54Z Post Jenkins test report page instead of the full console output page --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 I'm checking whether the original behavior is consistent (do we always respect data schema column order when partition columns are included in data schema?). If not, I call it a bug and we just fix it. If yes, we may want to come up with a fix for this PR that preserves this old behavior and add a feature flag to deprecate it and enable the more preferrable behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 @brkyvz I agree that always moving all partitioned columns to the end of the schema is more consistent and intuitive. However, users may have ordinal-dependent code like this: ```scala df.rdd.map { case Row(name: String, age: Int) => ... } ``` If we silently change the column order, these applications may break. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16156 Hey @xwu0226 @gatorsmile, did some investigation, and I don't think this is a bug now. Please refer to [my JIRA comment][1] for more details. [1]: https://issues.apache.org/jira/browse/SPARK-18539?focusedCommentId=15723747=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15723747 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16156 @xwu0226 Just tested that this issue also affects the normal Parquet reader (by setting `spark.sql.parquet.enableVectorizedReader` to `false`). That's also why #9940 couldn't take a similar approach as this one. Because `ParquetRecordReader` is a 3rd party class provided by parquet-mr. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16156#discussion_r90973620 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -107,7 +107,16 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); MessageType fileSchema = footer.getFileMetaData().getSchema(); FilterCompat.Filter filter = getFilter(configuration); - blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + try { +blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + } catch (IllegalArgumentException e) { +// In the case where a particular parquet files does not contain +// the column(s) in the filter, we don't do filtering at this level +// PARQUET-389 will resolve this issue in Parquet 1.9, which may be used +// by future Spark versions. This is a workaround for current Spark version. +// Also the assumption here is that the predicates will be applied later +blocks = footer.getBlocks(); --- End diff -- Thanks. We may just mention [SPARK-18140][1] here. It's for upgrading parquet-mr to 1.9. [1]: https://issues.apache.org/jira/browse/SPARK-18140 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16156#discussion_r90972713 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -578,4 +578,66 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-18539 - filtered by non-existing parquet column") { +Seq("parquet").foreach { format => + withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { +withTempPath { path => + import testImplicits._ + Seq((1, "abc"), (2, "hello")).toDF("a", "b").write.format(format).save(path.toString) + + // user-specified schema contains nonexistent columns + val schema = StructType( +Seq(StructField("a", IntegerType), + StructField("b", StringType), + StructField("c", IntegerType))) + val readDf1 = spark.read.schema(schema).format(format).load(path.toString) + + // Read the table without any filter + checkAnswer(readDf1, Row(1, "abc", null) :: Row(2, "hello", null) :: Nil) + // Read the table with a filter on existing columns + checkAnswer(readDf1.filter("a < 2"), Row(1, "abc", null) :: Nil) + + // Read the table with a filter on nonexistent columns + checkAnswer(readDf1.filter("c < 2"), Nil) + checkAnswer(readDf1.filter("c is not null"), Nil) + + checkAnswer(readDf1.filter("c is null"), +Row(1, "abc", null) :: Row(2, "hello", null) :: Nil) + + checkAnswer(readDf1.filter("c is null and a < 2"), +Row(1, "abc", null) :: Nil) + + // With another parquet file that contains the column "c" + Seq((2, "abc", 3), (3, "hello", 4)).toDF("a", "b", "c") +.write.format(format).mode(SaveMode.Append).save(path.toString) + + // Right now, there are 2 parquet files. one with the column "c", + // the other does not have it. + val readDf2 = spark.read.schema(schema).format(format).load(path.toString) --- End diff -- Yea, what I concern is that we are using the vectorized Parquet reader by default now and may not have enough test coverage of the normal Parquet reader. Would be good to have test cases to prevent regression for the normal Parquet reader code path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16156#discussion_r90972121 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -107,7 +107,16 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); MessageType fileSchema = footer.getFileMetaData().getSchema(); FilterCompat.Filter filter = getFilter(configuration); - blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + try { +blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + } catch (IllegalArgumentException e) { +// In the case where a particular parquet files does not contain +// the column(s) in the filter, we don't do filtering at this level +// PARQUET-389 will resolve this issue in Parquet 1.9, which may be used +// by future Spark versions. This is a workaround for current Spark version. +// Also the assumption here is that the predicates will be applied later --- End diff -- No matter filters are pushed down to Parquet reader or not, Spark will always apply all the filters again at a higher level to ensure that all filters are applied as expected. Spark treats data source filter push-down in a "best effort" manner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 @brkyvz I also worry about the behavior change. Let me check whether the original behavior is by design or by accident. If it is a bug from the very beginning, then we should just fix it in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16156 BTW, I think this PR is a cleaner fix than #9940, which introduces a temporary metadata while merging two `StructType`s and erased it in a later phase. We may want to remove the hack done in #9940 later if possible. But for now, let's make the fix as surgical as possible to lower the risk for 2.1 release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16156 Actually, PR #9940 should have already fixed this issue. I'm checking why it doesn't work under 2.0.1 for 2.0.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16156#discussion_r90969603 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -578,4 +578,66 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-18539 - filtered by non-existing parquet column") { +Seq("parquet").foreach { format => + withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { +withTempPath { path => + import testImplicits._ + Seq((1, "abc"), (2, "hello")).toDF("a", "b").write.format(format).save(path.toString) + + // user-specified schema contains nonexistent columns + val schema = StructType( +Seq(StructField("a", IntegerType), + StructField("b", StringType), + StructField("c", IntegerType))) + val readDf1 = spark.read.schema(schema).format(format).load(path.toString) + + // Read the table without any filter + checkAnswer(readDf1, Row(1, "abc", null) :: Row(2, "hello", null) :: Nil) + // Read the table with a filter on existing columns + checkAnswer(readDf1.filter("a < 2"), Row(1, "abc", null) :: Nil) + + // Read the table with a filter on nonexistent columns + checkAnswer(readDf1.filter("c < 2"), Nil) + checkAnswer(readDf1.filter("c is not null"), Nil) + + checkAnswer(readDf1.filter("c is null"), +Row(1, "abc", null) :: Row(2, "hello", null) :: Nil) + + checkAnswer(readDf1.filter("c is null and a < 2"), +Row(1, "abc", null) :: Nil) + + // With another parquet file that contains the column "c" + Seq((2, "abc", 3), (3, "hello", 4)).toDF("a", "b", "c") +.write.format(format).mode(SaveMode.Append).save(path.toString) + + // Right now, there are 2 parquet files. one with the column "c", + // the other does not have it. + val readDf2 = spark.read.schema(schema).format(format).load(path.toString) --- End diff -- In fact, we probably also want to test both the normal Parquet reader and the vectorized Parquet reader using the similar trick by setting `SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16156#discussion_r90968974 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -578,4 +578,66 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex // scalastyle:on nonascii } } + + test("SPARK-18539 - filtered by non-existing parquet column") { +Seq("parquet").foreach { format => + withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { +withTempPath { path => + import testImplicits._ + Seq((1, "abc"), (2, "hello")).toDF("a", "b").write.format(format).save(path.toString) + + // user-specified schema contains nonexistent columns + val schema = StructType( +Seq(StructField("a", IntegerType), + StructField("b", StringType), + StructField("c", IntegerType))) + val readDf1 = spark.read.schema(schema).format(format).load(path.toString) + + // Read the table without any filter + checkAnswer(readDf1, Row(1, "abc", null) :: Row(2, "hello", null) :: Nil) + // Read the table with a filter on existing columns + checkAnswer(readDf1.filter("a < 2"), Row(1, "abc", null) :: Nil) + + // Read the table with a filter on nonexistent columns + checkAnswer(readDf1.filter("c < 2"), Nil) + checkAnswer(readDf1.filter("c is not null"), Nil) + + checkAnswer(readDf1.filter("c is null"), +Row(1, "abc", null) :: Row(2, "hello", null) :: Nil) + + checkAnswer(readDf1.filter("c is null and a < 2"), +Row(1, "abc", null) :: Nil) + + // With another parquet file that contains the column "c" + Seq((2, "abc", 3), (3, "hello", 4)).toDF("a", "b", "c") +.write.format(format).mode(SaveMode.Append).save(path.toString) + + // Right now, there are 2 parquet files. one with the column "c", + // the other does not have it. + val readDf2 = spark.read.schema(schema).format(format).load(path.toString) --- End diff -- By default, we don't enable Parquet schema merging. Could you please reorganize this test case into something like this to have both code paths tested? ```scala Seq(true, false).foreach { merge => test("SPARK-18539 - filtered by non-existing parquet column - schema merging enabled: $merge") { // ... val readDf2 = spark.read .option("mergeSchema", merge.toString) .schema(schema) .format(format) .load(path.toString) // ... } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/16156#discussion_r90968113 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -107,7 +107,16 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); MessageType fileSchema = footer.getFileMetaData().getSchema(); FilterCompat.Filter filter = getFilter(configuration); - blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + try { +blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); + } catch (IllegalArgumentException e) { +// In the case where a particular parquet files does not contain +// the column(s) in the filter, we don't do filtering at this level +// PARQUET-389 will resolve this issue in Parquet 1.9, which may be used +// by future Spark versions. This is a workaround for current Spark version. +// Also the assumption here is that the predicates will be applied later +blocks = footer.getBlocks(); --- End diff -- Could you please add a TODO item int the comment so that we won't forget to remove this after upgrading to 1.9? Thanks! ``` // TODO Remove this hack after upgrading to parquet-mr 1.9. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16106: [SPARK-17213][SQL] Disable Parquet filter push-down for ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16106 The last build failure doesn't seem to be relevant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16106: [SPARK-17213][SQL] Disable Parquet filter push-down for ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16106 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16106: [SPARK-17213][SQL] Disable Parquet filter push-do...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/16106 [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686 ## What changes were proposed in this pull request? Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`. ## How was this patch tested? New test case added in `ParquetFilterSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-17213-bad-string-ppd Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16106.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16106 commit d7162bfdb2a4389ea08cbb0d3627a429de6d7c45 Author: Cheng Lian <l...@databricks.com> Date: 2016-12-01T19:52:20Z Add test case commit 810c1533ae9ba959e9696a806b22e19f8d0b5c31 Author: Cheng Lian <l...@databricks.com> Date: 2016-12-01T19:54:09Z Disable Parquet filter push-down for string and binary columns commit ce71cca44009d2f8508e02291ac6872b680cd28a Author: Cheng Lian <l...@databricks.com> Date: 2016-12-01T19:59:37Z Test more string comparison operators --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 My hunch is that we somehow passed a wrong requested schema containing the partition column down to the vectorized Parquet reader. IIRC, we prune partition columns from the data schema when generating the requested schema for the underlying reader since partition values are directly available in the directory path, there's no need to read and decode them from the physical file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 @maropu I tried your snippet (with minor modifications). It works for 1.6.0 instead of 2.0.2: ```scala case class A(a: Long, b: Int) val as = Seq(A(1, 2)) val path = "/tmp/part" sqlContext.createDataFrame(as).write.mode("overwrite").parquet(s"$path/a=1/") val df = sqlContext.read.parquet(path) df.printSchema() df.collect() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15979 Also backported to branch-2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15979 @rxin Shall we backport this to branch-2.1? I think it's relatively safe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15979 Merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16030 @brkyvz @maropu Actually, we do allow users to create partitioned tables that allow data schema to contains (part of) the partition columns, and there are [test][1] [cases][2] for this use case. This use case is mostly useful when you are trying to reorganize an existing dataset into a partitioned form. Say you have a JSON dataset containing all the tweets in 2016 and you'd like to partition it by date. By allowing the data schema to contain partitioned columns, you may simply put JSON files of the same date into the same directory. Otherwise, you'll have to run an ETL job to erase the date column from the dataset, which can be time-consuming. As for the query @maropu mentioned in the PR description, the query itself is problematic, because it lacks a user-specified schema to override the data type of the partitioned column `a`. Ideally, partition discovery should be able to fill in the correct data type `LongType`, but it's impossible since the directory path doesn't really expose that information. That's why a user-specified schema is necessary. This query works in 2.0.2 because of the bug @brkyvz fixed: Spark ignores the data types of partitioned columns specified in the user-specified schema. Now the bug is fixed and this query doesn't work as expected. In short: 1. This isn't a regression, the original query itself is problematic. 2. For this PR, we can either just close it or try to provide a better error message in the read path (ask the user to provide a user-specified schema) when: - A partitioned columns `p` also appears in the data schema, and - The discovered data type of `p` is different from the data type specified in the data schema. An alternative is to override the discovered partition column data type using the one in the data schema if any. But I'd say this change is probably too risky at this moment for 2.1. [1]: https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala#L44-L66 [2]: https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala#L43-L66 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15979 Good to merge pending Jenkins. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15976: [SPARK-18403][SQL] Fix unsafe data false sharing issue i...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15976 @cloud-fan @dongjoon-hyun Thanks for the review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16061: [SPARK-18278] [Scheduler] Support native submission of s...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/16061 @erikerlandson For the RAT failure, you may either add Apache license header to newly added files or add the file to `dev/.rat-excludes`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15979 My only concern is that "non-flat" type is neither intuitive nor a well-known term. In fact, this PR only prevents `Option[T <: Product]` to be top-level Dataset types. How about just call them "Product" types? Otherwise LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/15979 LGTM, merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org