[GitHub] spark pull request #23152: [SPARK-26181][SQL] the `hasMinMaxStats` method of...

2018-11-29 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/23152#discussion_r237719791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -879,13 +879,13 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   }
 
   def hasCountStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasCountStats)
 
   def hasDistinctCount(a: Attribute): Boolean =
-get(a).map(_.distinctCount.isDefined).getOrElse(false)
+get(a).exists(_.distinctCount.isDefined)
 
   def hasMinMaxStats(a: Attribute): Boolean =
-get(a).map(_.hasCountStats).getOrElse(false)
+get(a).exists(_.hasMinMaxStats)
--- End diff --

@viirya This issue is hard to trigger because usually the count, distinct 
count, and min/max statistics appear together, and therefore `hasMinMaxStats` 
always return the right result.

A numeric column with all null values is sufficient to reproduce this since 
we have the count and distinct count statistics but not min/max statistics in 
this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...

2018-07-26 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/21879
  
@cloud-fan  Didn't try to actually reproduce this issue in branches other 
than branch-2.3, but just by checking the POM files, this issue existed ever 
since at least 1.6.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...

2018-07-26 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/21879
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...

2018-07-26 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/21879
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21879: [SPARK-24927][BUILD][BRANCH-2.3] The scope of snappy-jav...

2018-07-26 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/21879
  
The Kafka DStream test failures seem to be flaky and irrelevant.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21879: [SPARK-24927] The scope of snappy-java cannot be ...

2018-07-26 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/21879

[SPARK-24927] The scope of snappy-java cannot be "provided"

Please see [SPARK-24927][1] for more details.

[1]: https://issues.apache.org/jira/browse/SPARK-24927


Manually tested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-24927

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21879.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21879


commit 93c34da713136eb7b4ed8bb8775353c8219efa22
Author: Cheng Lian 
Date:   2018-07-26T07:01:53Z

The scope of snappy-java cannot be "provided"




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin

2018-07-24 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/21865
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin

2018-07-24 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/21865
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/20174
  
LGTM, merging to master and branch-2.3. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160770992
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -666,4 +665,16 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   assert(exchangePlans.length == 1)
 }
   }
+
+  Seq(true, false).foreach { codegen =>
+test("SPARK-22951: dropDuplicates on empty data frames should produce 
correct aggregate" +
+  s" results when codegen enabled: $codegen") {
+  withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, 
codegen.toString)) {
+assert(Seq.empty[Int].toDF("a").count() == 0)
+assert(Seq.empty[Int].toDF("a").agg(count("*")).count() == 1)
+assert(spark.emptyDataFrame.dropDuplicates().count() == 0)
+
assert(spark.emptyDataFrame.dropDuplicates().agg(count("*")).count() == 1)
--- End diff --

@liufengdb Maybe also add assertions to confirm that explicit global 
aggregations (by providing zero grouping keys) still return one row? For 
example:

```scala
val emptyAgg = Map.empty[String, String]

checkAnswer(
  spark.emptyDataFrame.agg(emptyAgg),
  Seq(Row())
)

checkAnswer(
  spark.emptyDataFrame.groupBy().agg(emptyAgg),
  Seq(Row())
)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-10 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/20174
  
@mgaido91 We can't because we do not know whether there are any input rows 
or not. For example:

```scala
val df1 = spark.range(10).select()
val df2 = spark.range(10).filter($"id" < 0).select()
val df3 = df1.dropDuplicates()
val df4 = df2.dropDuplicates()
```

`df1` has zero columns and ten rows while `df2` has no columns and zero 
rows. Therefore, `df3` should return one row containing zero columns while 
`df4` should return zero rows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160612592
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends 
Rule[LogicalPlan] {
   Alias(new First(attr).toAggregateExpression(), 
attr.name)(attr.exprId)
 }
   }
-  Aggregate(keys, aggCols, child)
+  // SPARK-22951: the implementation of aggregate operator treats the 
cases with and without
+  // grouping keys differently, when there are not input rows. For the 
aggregation after
+  // `dropDuplicates()` on an empty data frame, a grouping key is 
added here to make sure the
+  // aggregate operator can work correctly (returning an empty 
iterator).
--- End diff --

> SPARK-22951: Physical aggregate operators distinguishes global 
aggregation and grouping aggregations by checking the number of grouping keys. 
The key difference here is that a global aggregation always returns at least 
one row even if there are no input rows. Here we append a literal when the 
grouping key list is empty so that the result aggregate operator is properly 
treated as a grouping aggregation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-10 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160611832
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -1221,7 +1221,12 @@ object ReplaceDeduplicateWithAggregate extends 
Rule[LogicalPlan] {
   Alias(new First(attr).toAggregateExpression(), 
attr.name)(attr.exprId)
 }
   }
-  Aggregate(keys, aggCols, child)
+  // SPARK-22951: the implementation of aggregate operator treats the 
cases with and without
+  // grouping keys differently, when there are not input rows. For the 
aggregation after
+  // `dropDuplicates()` on an empty data frame, a grouping key is 
added here to make sure the
+  // aggregate operator can work correctly (returning an empty 
iterator).
+  val newKeys = if (keys.isEmpty) Literal(1) :: Nil else keys
--- End diff --

Nit: Maybe rename `newKeys` to `nonemptyKeys`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20174: [SPARK-22951][SQL] fix aggregation after dropDupl...

2018-01-09 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/20174#discussion_r160602821
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -19,6 +19,8 @@ package org.apache.spark.sql
 
 import scala.util.Random
 
+import org.apache.spark.sql.catalyst.expressions.Literal
+import org.apache.spark.sql.catalyst.expressions.aggregate.Percentile
--- End diff --

Remove unused imports?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20174: [SPARK-22951][SQL] fix aggregation after dropDuplicates ...

2018-01-09 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/20174
  
@liufengdb I wrote a summary according to our offline discussion to explain 
the subtle change made in this PR. Please feel free to use it in the PR 
description if it looks good to you :)



Same as all the other SQL query engines, Spark SQL supports both grouping 
aggregation, e.g.:

```sql
SELECT count(*) FROM a_table GROUP BY key
```

and global aggregation, e.g.:

```sql
SELECT count(*) FROM a_table
```

Essentially, global aggregation is a special case of grouping aggregation 
where the whole table is treated as a single group. A key difference, though, 
exists in the case where a group contains zero rows:

- Grouping aggregation

  ```sql
  SELECT count(*) AS c FROM range(3) WHERE id < 0 GROUP BY id
  -- +---+
  -- | c |
  -- +---+
  -- +---+
  ```

  The above query returns zero rows.

- Global aggregation

  ```sql
  SELECT count(*) AS c FROM range(3) WHERE id < 0
  -- +---+
  -- | c |
  -- +---+
  -- | 0 |
  -- +---+
  ```

  The above query returns one row. To be more specific, global aggregation 
with zero input rows always return a single row with the initial aggregation 
state as the output.

To tell whether an `Aggregate` operator `A` is a global aggregation or not, 
Spark SQL simply checks the number of grouping keys, and `A` is a global 
aggregation if it has zero grouping keys.

However, this simple priciple drops the ball in the following case:

```scala
spark.emptyDataFrame.dropDuplicates().agg(count($"*") as "c").show()
// +---+
// | c |
// +---+
// | 1 |
// +---+
```

The reason is that:

1.  `df.dropDuplicates()` is roughly translated into something equivalent 
to:

```
val allColumns = df.columns.map { col }
df.groupBy(allColumns: _*).agg(allColumns.head, allColumns.tail: _*)
```

This translation is implemented in the rule 
`ReplaceDeduplicateWithAggregate`.

2.  `spark.emptyDataFrame` contains zero columns and zero rows.

Therefore, rule `ReplaceDeduplicateWithAggregate` translates 
`spark.emptyDataFrame.dropDuplicates()` into something equivalent to:

```scala
spark.emptyDataFrame.groupBy().agg(Map.empty[String, String])
```

Which confuses Spark SQL and gets recognized as a global aggregation 
because the aggregate operator contains no grouping keys. As a result, Spark 
SQL allocates a single row filled by the initial aggregation state and uses it 
as the output, and returns a wrong result.

To fix this issue, this PR tweaks `ReplaceDeduplicateWithAggregate` by 
appending a literal `1` to the grouping key list of the resulting `Aggregate` 
operator when the input plan contains zero output columns. In this way, 
`spark.emptyDataFrame.dropDuplicates()` is now translated into:

```scala
spark.emptyDataFrame.dropDuplicates()
=> spark.emptyDataFrame.groupBy(lit(1)).agg(Map.empty[String, String])
```

Which is now properly treated as a grouping aggregation and returns the 
correct answer.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-11-13 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/19439
  
@jkbradley I'm not confident enough about this part but a quick check 
suggested that typically `PathFilter`s are used in 
`FileInputFormat.listStatus()`, which is usually called in 
`FileInputFormat.getSplits()` method, and `getSplits()` is used by Spark to 
determine RDD partitions on the driver side. That said, in this specific 
typical scenario, the behavior of `SamplePathFilter` should be deterministic. 
However, I'd say this assumption is fragile since `PathFilter`s are used in a 
pretty ad-hoc way throughout the whole Hadoop ecosystem and my impression is 
that `PathFilter`s themselves are expected to be deterministic in general.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19386: [SPARK-22161] [SQL] Add Impala-modified TPC-DS queries

2017-09-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/19386
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19361: [SPARK-22140] Add TPCDSQuerySuite

2017-09-27 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19361#discussion_r141501351
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -0,0 +1,348 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.sql.catalyst.util.resourceToString
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class TPCDSQuerySuite extends QueryTest with SharedSQLContext with 
BeforeAndAfterAll {
+
+  /**
+   * Drop all the tables
+   */
+  protected override def afterAll(): Unit = {
+try {
+  spark.sessionState.catalog.reset()
+} finally {
+  super.afterAll()
+}
+  }
+
+  override def beforeAll() {
+super.beforeAll()
+sql(
+  """
+|CREATE TABLE `catalog_page` (
+|`cp_catalog_page_sk` INT, `cp_catalog_page_id` STRING, 
`cp_start_date_sk` INT,
+|`cp_end_date_sk` INT, `cp_department` STRING, `cp_catalog_number` 
INT,
+|`cp_catalog_page_number` INT, `cp_description` STRING, `cp_type` 
STRING)
+|USING parquet
+  """.stripMargin)
+
+sql(
+  """
+|CREATE TABLE `catalog_returns` (
+|`cr_returned_date_sk` INT, `cr_returned_time_sk` INT, 
`cr_item_sk` INT,
+|`cr_refunded_customer_sk` INT, `cr_refunded_cdemo_sk` INT, 
`cr_refunded_hdemo_sk` INT,
+|`cr_refunded_addr_sk` INT, `cr_returning_customer_sk` INT, 
`cr_returning_cdemo_sk` INT,
+|`cr_returning_hdemo_sk` INT, `cr_returning_addr_sk` INT, 
`cr_call_center_sk` INT,
+|`cr_catalog_page_sk` INT, `cr_ship_mode_sk` INT, 
`cr_warehouse_sk` INT, `cr_reason_sk` INT,
+|`cr_order_number` INT, `cr_return_quantity` INT, 
`cr_return_amount` DECIMAL(7,2),
+|`cr_return_tax` DECIMAL(7,2), `cr_return_amt_inc_tax` 
DECIMAL(7,2), `cr_fee` DECIMAL(7,2),
+|`cr_return_ship_cost` DECIMAL(7,2), `cr_refunded_cash` 
DECIMAL(7,2),
+|`cr_reversed_charge` DECIMAL(7,2), `cr_store_credit` DECIMAL(7,2),
+|`cr_net_loss` DECIMAL(7,2))
+|USING parquet
+  """.stripMargin)
+
+sql(
+  """
+|CREATE TABLE `customer` (
+|`c_customer_sk` INT, `c_customer_id` STRING, `c_current_cdemo_sk` 
INT,
+|`c_current_hdemo_sk` INT, `c_current_addr_sk` INT, 
`c_first_shipto_date_sk` INT,
+|`c_first_sales_date_sk` INT, `c_salutation` STRING, 
`c_first_name` STRING,
+|`c_last_name` STRING, `c_preferred_cust_flag` STRING, 
`c_birth_day` INT,
+|`c_birth_month` INT, `c_birth_year` INT, `c_birth_country` 
STRING, `c_login` STRING,
+|`c_email_address` STRING, `c_last_review_date` STRING)
+|USING parquet
+  """.stripMargin)
+
+sql(
+  """
+|CREATE TABLE `customer_address` (
+|`ca_address_sk` INT, `ca_address_id` STRING, `ca_street_number` 
STRING,
+|`ca_street_name` STRING, `ca_street_type` STRING, 
`ca_suite_number` STRING,
+|`ca_city` STRING, `ca_county` STRING, `ca_state` STRING, `ca_zip` 
STRING,
+|`ca_country` STRING, `ca_gmt_offset` DECIMAL(5,2), 
`ca_location_type` STRING)
+|USING parquet
+  """.stripMargin)
+
+sql(
+  """
+|CREATE TABLE `customer_demographics` (
+|`cd_demo_sk` INT, `cd_gender` STRING, `cd_marital_status` STRING,
+|`cd_education_status` STRING, `cd_purchase_estimate` INT, 
`cd_credit_rating` STRING,
+|`cd_dep_count` INT, `cd_dep_employed_count` INT, 
`cd_dep_college_count` INT)
+|USING parquet
+  """.stripMargin)
+
+sq

[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

2017-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19080#discussion_r136243745
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -30,18 +30,43 @@ import org.apache.spark.sql.types.{DataType, 
IntegerType}
  *  - Intra-partition ordering of data: In this case the distribution 
describes guarantees made
  *about how tuples are distributed within a single partition.
  */
-sealed trait Distribution
+sealed trait Distribution {
+  /**
+   * The required number of partitions for this distribution. If it's 
None, then any number of
+   * partitions is allowed for this distribution.
+   */
+  def requiredNumPartitions: Option[Int]
+
+  /**
+   * Creates a default partitioning for this distribution, which can 
satisfy this distribution while
+   * matching the given number of partitions.
+   */
+  def createPartitioning(numPartitions: Int): Partitioning
+}
 
 /**
  * Represents a distribution where no promises are made about co-location 
of data.
  */
-case object UnspecifiedDistribution extends Distribution
+case object UnspecifiedDistribution extends Distribution {
+  override def requiredNumPartitions: Option[Int] = None
+
+  override def createPartitioning(numPartitions: Int): Partitioning = {
+throw new IllegalStateException("UnspecifiedDistribution does not have 
default partitioning.")
+  }
+}
 
 /**
  * Represents a distribution that only has a single partition and all 
tuples of the dataset
  * are co-located.
  */
-case object AllTuples extends Distribution
+case object AllTuples extends Distribution {
--- End diff --

Yea, I did miss the point that `ClusteredDistribution` covers both 
`RangePartitioning` and `HashPartitioning`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...

2017-08-30 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19080#discussion_r136211685
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
 ---
@@ -30,18 +30,43 @@ import org.apache.spark.sql.types.{DataType, 
IntegerType}
  *  - Intra-partition ordering of data: In this case the distribution 
describes guarantees made
  *about how tuples are distributed within a single partition.
  */
-sealed trait Distribution
+sealed trait Distribution {
+  /**
+   * The required number of partitions for this distribution. If it's 
None, then any number of
+   * partitions is allowed for this distribution.
+   */
+  def requiredNumPartitions: Option[Int]
+
+  /**
+   * Creates a default partitioning for this distribution, which can 
satisfy this distribution while
+   * matching the given number of partitions.
+   */
+  def createPartitioning(numPartitions: Int): Partitioning
+}
 
 /**
  * Represents a distribution where no promises are made about co-location 
of data.
  */
-case object UnspecifiedDistribution extends Distribution
+case object UnspecifiedDistribution extends Distribution {
+  override def requiredNumPartitions: Option[Int] = None
+
+  override def createPartitioning(numPartitions: Int): Partitioning = {
+throw new IllegalStateException("UnspecifiedDistribution does not have 
default partitioning.")
+  }
+}
 
 /**
  * Represents a distribution that only has a single partition and all 
tuples of the dataset
  * are co-located.
  */
-case object AllTuples extends Distribution
+case object AllTuples extends Distribution {
--- End diff --

I think the concern about logical/physical separation is only useful if we 
take CBO into consideration. E.g., for a `AllTuples` distribution requirement, 
the planner may produce two plans using `SinglePartition` and 
`BroadcastPartitioning` respectively and pick a cheaper one. In the scope of 
our current planner framework, this separation doesn't seem to be very useful, 
though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18712: [SPARK-17528][SQL][followup] remove unnecessary data cop...

2017-07-24 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18712
  
Nice, didn't know that the copy issue has already been fixed.

LGTM, merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17419: [SPARK-19634][ML] Multivariate summarizer - dataframes A...

2017-07-20 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17419
  
@WeichenXu123 and I did some profiling using `jvisualvm` and found that 40% 
of the time is spent in the copy performed by [this `safeProjection`][1]. This 
is a known issue used to fight against the false sharing issue @cloud-fan and I 
hit before.

@cloud-fan tried to fix this issue in #15082 but that PR didn't work out 
due to some other concerns (I can't remember all the details now).

@cloud-fan, any ideas about improving `ObjectHashAggregateExec` (e.g. 
adding code generation support)?

[1]: 
https://github.com/apache/spark/blob/03367d7aa3acd3abbcacba57ea75c8efa2a9a794/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectAggregationIterator.scala#L154


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18470: [SPARK-21258][SQL] Fix WindowExec complex object aggrega...

2017-06-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18470
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18412: [SPARK-21203] [SQL] Fix wrong results of insertio...

2017-06-26 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/18412#discussion_r124062223
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -482,15 +482,15 @@ case class Cast(child: Expression, dataType: 
DataType, timeZoneId: Option[String
   case (fromField, toField) => cast(fromField.dataType, 
toField.dataType)
 }
 // TODO: Could be faster?
-val newRow = new GenericInternalRow(from.fields.length)
 buildCast[InternalRow](_, row => {
+  val newRow = new GenericInternalRow(from.fields.length)
   var i = 0
   while (i < row.numFields) {
 newRow.update(i,
   if (row.isNullAt(i)) null else castFuncs(i)(row.get(i, 
from.apply(i).dataType)))
 i += 1
   }
-  newRow.copy()
--- End diff --

@hvanhovell I found the semantics of the original 
`GenericMutableRow.copy()` pretty dangerous since it didn't properly implement 
deep copy semantics. In fact, it's impossible to do a proper deep copy there. 
We only copy the underlying field array but may still share field objects 
accidentally.

If we do want to "fix" `GenericInternalRow.copy()`, I'd prefer throwing an 
exception instead of following the old `GenericMutableRow.copy()` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to ...

2017-06-02 Thread liancheng
Github user liancheng closed the pull request at:

https://github.com/apache/spark/pull/18181


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1

2017-06-02 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18181
  
@rdblue @mallman Thanks for the comments! As mentioned in the JIRA ticket, 
we've decided to preserve parquet-mr 1.8.2. Instead, we'll add a release notes 
entry to suggest using parquet-avro 1.8.1 instead of 1.8.2 to avoid the Avro 
dependency conflict.

I'm closing this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1

2017-06-02 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18181
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1

2017-06-02 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18181
  
Unfortunately, rolling back parquet-mr to 1.8.1 brings back 
[PARQUET-389][1], which breaks multiple test cases involving schema evolution 
(add a new column to a Parquet table and filter on that column).

Trying to figure out a workaround for this but haven't got any luck yet.

[1]: https://issues.apache.org/jira/browse/PARQUET-389


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1

2017-06-02 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18181
  
@viirya Thanks for reminding! I'm reverting that one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1

2017-06-02 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/18181
  
@dongjoon-hyun I already reverted PR #16751 manually but forgot to mention 
it in the PR description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18181: [SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to ...

2017-06-01 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/18181

[SPARK-20958][SQL] Roll back parquet-mr 1.8.2 to 1.8.1

## What changes were proposed in this pull request?

This PR reverts PR #16791, #16817, and part of #16795 to roll back 
parquet-mr 1.8.2 to 1.8.1 to escape from a dependency hell caused by avro 1.7.7 
and 1.8.1.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark rollback-parquet-mr-1.8.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18181


commit d03d4de8744e8a0eb4b6002b3f4a4e98be35eab6
Author: Cheng Lian <l...@databricks.com>
Date:   2017-06-02T00:08:38Z

Roll back parquet-mr 1.8.2 to 1.8.1

commit 6dcef66617d95fdffdff71d20a228863d88f97cf
Author: Cheng Lian <l...@databricks.com>
Date:   2017-06-02T00:10:59Z

Update dependency manifests

commit c15760bb49e59a0e71971e3a37cc03ffb5ffc60a
Author: Cheng Lian <l...@databricks.com>
Date:   2017-06-02T00:15:15Z

Revert "[SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to 
bugs of old Parquet versions"

This reverts commit 7730426cb95eec2652a9ea979ae2c4faf7e585f2.

commit cf461d08ea9dda4a3220a5aecf1c22d66d432d99
Author: Cheng Lian <l...@databricks.com>
Date:   2017-06-02T00:30:29Z

Revert "[SPARK-17213][SQL][FOLLOWUP] Re-enable Parquet filter tests for 
binary and string"

This reverts commit 0f16ff5b0ec8cd828774ba5ddb276d7b06dbe273.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17469: [SPARK-20132][Docs] Add documentation for column string ...

2017-05-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17469
  
Also cherry-picked this to branch-2.2.

cc @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17693: [SPARK-20314][SQL] Inconsistent error handling in JSON p...

2017-04-20 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17693
  
Not suggesting doing it in this PR but maybe adding a SQL option to let the 
users choose the error handling strategy of all the JSON functions probably 
makes more sense here? The Spark JSON data source allows users to choose a 
parsing mode among:

- `PERMISSIVE`: replacing malformed records with nulls,
- `DROPMALFORMED`: drop malformed records, and
- `FAILFAST`: report an error and abort once a malformed record is found.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17398: [SPARK-19716][SQL] support by-name resolution for struct...

2017-04-04 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17398
  
LGTM. Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17454: [SPARK-20125][SQL] Dataset of type option of map does no...

2017-03-28 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17454
  
OK, resolved the conflict manually and merged to branch-2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17454: [SPARK-20125][SQL] Dataset of type option of map does no...

2017-03-28 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17454
  
This PR conflicts with branch-2.1, trying to resolve manually while merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17454: [SPARK-20125][SQL] Dataset of type option of map does no...

2017-03-28 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17454
  
LGTM, merging to master and branch-2.1. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to b...

2017-03-13 Thread liancheng
Github user liancheng closed the pull request at:

https://github.com/apache/spark/pull/16917


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles ...

2017-03-10 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/17247#discussion_r105489225
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1865,4 +1865,12 @@ class HiveDDLSuite
   }
 }
   }
+
+  test("SPARK-19905: Hive SerDe table input paths") {
+withTable("spark_19905") {
+  spark.range(10).createOrReplaceTempView("spark_19905_view")
+  sql("CREATE TABLE spark_19905 STORED AS RCFILE AS SELECT * FROM 
spark_19905_view")
+  assert(spark.table("spark_19905").inputFiles.nonEmpty)
--- End diff --

`input_file_name` and `Dataset.inputFiles` are different code paths.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles ...

2017-03-10 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/17247#discussion_r105475257
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2734,6 +2735,8 @@ class Dataset[T] private[sql](
 fsBasedRelation.inputFiles
   case fr: FileRelation =>
 fr.inputFiles
+  case r: CatalogRelation if DDLUtils.isHiveTable(r.tableMeta) =>
+r.tableMeta.storage.locationUri.map { _.toString }.toArray
--- End diff --

Didn't use `r.tableMeta.location` here intentionally for safty.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles for Hiv...

2017-03-10 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17247
  
cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17247: [SPARK-19905][SQL] Bring back Dataset.inputFiles ...

2017-03-10 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/17247

[SPARK-19905][SQL] Bring back Dataset.inputFiles for Hive SerDe tables

## What changes were proposed in this pull request?

`Dataset.inputFiles` works by matching `FileRelation`s in the query plan. 
In Spark 2.1, Hive SerDe tables are represented by `MetastoreRelation`, which 
inherits from `FileRelation`. However, in Spark 2.2, Hive SerDe tables are now 
represented by `CatalogRelation`, which doesn't inherit from `FileRelation` 
anymore, due to the unification of Hive SerDe tables and data source tables. 
This change breaks `Dataset.inputFiles` for Hive SerDe tables.

This PR tries to fix this issue by explicitly matching `CatalogRelation`s 
that are Hive SerDe tables in `Dataset.inputFiles`. Note that we can't make 
`CatalogRelation` inherit from `FileRelation` since not all `CatalogRelation`s 
are file based (e.g., JDBC data source tables).

## How was this patch tested?

New test case added in `HiveDDLSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark 
spark-19905-hive-table-input-files

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17247


commit 3e0abc48de8219e3daf839584ec874855ced1210
Author: Cheng Lian <l...@databricks.com>
Date:   2017-03-10T19:42:17Z

Bring back Dataset.inputFiles for Hive SerDe tables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17168: [SPARK-19737][SQL] New analysis rule for reporting unreg...

2017-03-06 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17168
  
Thanks for the review! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17168: [SPARK-19737][SQL] New analysis rule for reporting unreg...

2017-03-06 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17168
  
@cloud-fan I'm afraid that's even more complicated an approach since you'll 
need to make function builders, which are now simply regular Scala functions, 
into unresolved expressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17168: [SPARK-19737][SQL] New analysis rule for reporting unreg...

2017-03-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/17168
  
cc @cloud-fan @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17168: [SPARK-19737][SQL] New analysis rule for reportin...

2017-03-05 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/17168

[SPARK-19737][SQL] New analysis rule for reporting unregistered functions 
without relying on relation resolution

## What changes were proposed in this pull request?

This PR adds a new `Once` analysis rule batch consists of a single analysis 
rule `LookupFunctions` that performs simple existence check over 
`UnresolvedFunctions` without actually resolving them.

The benefit of this rule is that it doesn't require function arguments to 
be resolved first and therefore doesn't rely on relation resolution, which may 
incur potentially expensive partition/schema discovery cost.

Please refer to [SPARK-19737][1] for more details about the motivation.

## How was this patch tested?

New test case added in `AnalysisErrorSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-19737-lookup-functions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17168


commit c9da94ce0d15333f30653f67c6a1af6bd0f10589
Author: Cheng Lian <l...@databricks.com>
Date:   2017-03-05T06:16:26Z

New analysis rule for detect undefined functions more quickly




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test

2017-02-15 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16935
  
Merging to master an branch-2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1...

2017-02-14 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16917
  
cc @JoshRosen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16866: [SPARK-19529] TransportClientFactory.createClient() shou...

2017-02-13 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16866
  
Merged to master, branch-2.1, and branch-2.0.

Files involved in branch-1.6 were moved to new directories and made it hard 
to cherry-pick. Created PR #16917 to backport this one to 1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to b...

2017-02-13 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/16917

[SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1.6

## What changes were proposed in this pull request?

This PR backports PR #16866 to branch-1.6

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-19529-1.6-backport

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16917.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16917


commit f7d7c54a6b0316d66dd03b3d712df7d33f7cf5b9
Author: Cheng Lian <l...@databricks.com>
Date:   2017-02-13T22:12:42Z

Backport PR #16866 to branch-1.6




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16866: [SPARK-19529] TransportClientFactory.createClient() shou...

2017-02-13 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16866
  
Branch-2.1 test compilation happens to be broken right now.

Trying to fix the compilation failure first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16866: [SPARK-19529] TransportClientFactory.createClient() shou...

2017-02-13 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16866
  
Merging to master, branch-2.1, branch-2.0, and branch-1.6.

Conflicts occurred, I'm fixing them manually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2017-02-10 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16161
  
Thanks. Backported to branch-2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2017-02-10 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16161
  
Shall we backport this to branch-2.1? I'd consider this as a bug because of 
the following snippet fail in Spark 2.1:

```scala
case class Wrapper1(value: Option[Map[String, String]])
case class Wrapper2(value: Map[String, String])

val ds1 = Seq.empty[Wrapper1].toDS()  // Fail
val ds2 = Seq.empty[Wrapper2].toDS()  // Succeed
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16795: [SPARK-19409][BUILD] Fix ParquetAvroCompatibilitySuite f...

2017-02-06 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16795
  
@dongjoon-hyun, you may add `[TEST-MAVEN]` in the PR title to ask Jenkins 
to test this PR using Maven.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

2017-02-06 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16791
  
@HyukjinKwon Sorry that I didn't see your comment before this PR got 
merged. I believe PARQUET-686 had already been fixed by apache/parquet-mr#367 
but wasn't marked as resolved in JIRA. Thanks for sending out #16817 for 
re-enabling the tests!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

2017-02-03 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16791
  
I see, thanks for the context. But I'd like to keep this Maven build 
failure fix in a separate PR so that people can easily cherry-pick the fix. 
Also, it helps to keep this PR easier to follow. Could you please send another 
PR to fix the Maven failure?

BTW, IIUC, this failure only affects Maven builds and probably that's why 
it wasn't caused by the Jenkins PR builder when testing #16751?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

2017-02-03 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16791
  
Hope we finally have proper Parquet filter push-down for string/binary 
columns (cross fingers)!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

2017-02-03 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16791
  
@dongjoon-hyun Actually, could you please point me the Maven build failure? 
I don't think this failure is caused by this PR, is it? Are you refering to 
some existing PR introduced by some earlier commit(s)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

2017-02-03 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16791
  
@dongjoon-hyun Ah, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/h...

2017-02-03 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16791
  
cc @cloud-fan @rxin @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16791: [SPARK-19409][SPARK-17213] Cleanup Parquet workar...

2017-02-03 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/16791

[SPARK-19409][SPARK-17213] Cleanup Parquet workarounds/hacks due to bugs of 
old Parquet versions

## What changes were proposed in this pull request?

We've already upgraded parquet-mr to 1.8.2. This PR does some further 
cleanup by removing a workaround of PARQUET-686 and a hack due to PARQUET-363 
and PARQUET-278. All three Parquet issues are fixed in parquet-mr 1.8.2.

## How was this patch tested?

Existing unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark parquet-1.8.2-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16791.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16791


commit 5d97b39d2c196ba30a5eaa0c5e782f6c115668fc
Author: Cheng Lian <l...@databricks.com>
Date:   2017-01-23T21:05:44Z

Upgrade Parquet to 1.8.2-rc1




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16649: [SPARK-19295] [SQL] IsolatedClientLoader's downloadVersi...

2017-01-19 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16649
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16564: [SPARK-19065][SS]Rewrite Alias in StreamExecution if nec...

2017-01-12 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16564
  
LGTM as long as we decide to preserve #15427.

The root cause of this issue is still the `df("col")` syntax, which is the 
motivation behind #15427. We decided not to deprecate/remove this syntax to 
ensure backward compatibility but maybe we should reconsider that decision as 
it keeps causing issues and confuses users?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16424: [SPARK-19016][SQL][DOC] Document scalable partition hand...

2016-12-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16424
  
OK, I'm merging this to master and branch-2.1. Thanks for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16424: [SPARK-19016][SQL][DOC] Document scalable partition hand...

2016-12-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16424
  
@ericl @CodingCat Thanks for the review! Fixed per your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16424: [SPARK-19016][SQL][DOC] Document scalable partiti...

2016-12-29 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16424#discussion_r94181109
  
--- Diff: docs/sql-programming-guide.md ---
@@ -526,11 +526,18 @@ By default `saveAsTable` will create a "managed 
table", meaning that the locatio
 be controlled by the metastore. Managed tables will also have their data 
deleted automatically
 when a table is dropped.
 
-Currently, `saveAsTable` does not expose an API supporting the creation of 
an "External table" from a `DataFrame`, 
-however, this functionality can be achieved by providing a `path` option 
to the `DataFrameWriter` with `path` as the key 
-and location of the external table as its value (String) when saving the 
table with `saveAsTable`. When an External table 
+Currently, `saveAsTable` does not expose an API supporting the creation of 
an "external table" from a `DataFrame`,
+however. This functionality can be achieved by providing a `path` option 
to the `DataFrameWriter` with `path` as the key
--- End diff --

No...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16424: [SPARK-19016][SQL][DOC] Document scalable partiti...

2016-12-28 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/16424

[SPARK-19016][SQL][DOC] Document scalable partition handling

## What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of 
the programming guide.

Before this PR, we only mention it in the migration guide. It's not super 
clear that external datasource tables require an extra `MSCK REPAIR TABLE` 
command is to have per-partition information persisted since 2.1.

## How was this patch tested?

N/A.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark 
scalable-partition-handling-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16424.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16424


commit 2e47d7da9251bd4d2dca6071db96f8b8a3dc6f1e
Author: Cheng Lian <l...@databricks.com>
Date:   2016-12-28T19:14:18Z

Document scalable partition handling




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16184: [SPARK-18753][SQL] Keep pushed-down null literal as a fi...

2016-12-14 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16184
  
LGTM, thanks for working on this!

I think the problem is that we generate a foldable predicate (`lit(null)`) 
during optimization phase but fail to fold it. Ideally, the optimizer should 
fold `IsNotNull($"_1") && lit(null)` into a `lit(false)`. But it's still good 
to check for foldable predicates at planning phase.

Merging this to master and branch-2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-12-14 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
Another thing that @cloud-fan and I agreed upon is that our current 
`DataFrameReader.schema()` interface method is insufficient. We may want to add

- `DataFrameReader.dataSchema()`, and
- `DataFrameReader.partitoinSchema()`

in the future to capture cases where data schema and partition schema 
overlap with each other.

The reason is that `FileFormat.buildReader()` requires a `dataSchema` 
argument, which should be the full schema of the underlying data files and may 
contain one or more partition columns. The reason is that many data sources 
(e.g. CSV and LibSVM) simply don't support column pruning, and you have to read 
out all physical columns first and then project out unnecessary ones.

Say we have a table with the following schema information:

- Data schema: `[x, y, p1]`
- Partition schema: `[p1, p2]`

Say a user tries to read this table with a user-specified schema `[x, y, 
p1, p2]`. Now we've no idea whether `p1` is part of the data schema or not 
since we skip schema inference when a user-specified schema is provided.

Although I haven't tested it extensively, I'd say it would be problematic 
if we try to use user-specified schema together with data sources without 
column pruning support (e.g. CSV) when the data schema overlaps with partition 
schema.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-12-13 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
@maropu @brkyvz Sorry for the delay, I was blocked by some other tasks 
during the last a few days.

@cloud-fan and I just did some investigation and we think we came up with a 
minimal fix to this issue that also preserves the old behavior. Although we do 
agree that the new behavior makes more sense (moving all partition columns to 
the end of the output schema), it is a breaking change and we'd like to 
minimize risks for 2.1 release.

@cloud-fan will give a summary of the new fix and reasons behind it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16163: [SPARK-18730] Post Jenkins test report page instead of t...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16163
  
@srowen Thanks. I sent this one because the `consoleFull` page frequently 
freezes my browser these days, not mentioning viewing Jenkins build results via 
mobile phone...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16163: [SPARK-18730] Post Jenkins test report page inste...

2016-12-05 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/16163

[SPARK-18730] Post Jenkins test report page instead of the full console 
output page to GitHub

## What changes were proposed in this pull request?

Currently, the full console output page of a Spark Jenkins PR build can be 
as large as several megabytes. It takes a relatively long time to load and may 
even freeze the browser for quite a while.

This PR makes the build script to post the test report page link to GitHub 
instead. The test report page is way more concise and is usually the first page 
I'd like to check when investigating a Jenkins build failure.

## How was this patch tested?

N/A.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark jenkins-test-report

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16163


commit 6aa9f34fa2abd02ae07dea5c0a404d67f7ae5998
Author: Cheng Lian <l...@databricks.com>
Date:   2016-12-06T00:44:54Z

Post Jenkins test report page instead of the full console output page




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
I'm checking whether the original behavior is consistent (do we always 
respect data schema column order when partition columns are included in data 
schema?). If not, I call it a bug and we just fix it. If yes, we may want to 
come up with a fix for this PR that preserves this old behavior and add a 
feature flag to deprecate it and enable the more preferrable behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
@brkyvz I agree that always moving all partitioned columns to the end of 
the schema is more consistent and intuitive. However, users may have 
ordinal-dependent code like this:

```scala
df.rdd.map { case Row(name: String, age: Int) =>
  ...
}
```

If we silently change the column order, these applications may break.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16156
  
Hey @xwu0226 @gatorsmile, did some investigation, and I don't think this is 
a bug now. Please refer to [my JIRA comment][1] for more details.

[1]: 
https://issues.apache.org/jira/browse/SPARK-18539?focusedCommentId=15723747=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15723747


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16156
  
@xwu0226 Just tested that this issue also affects the normal Parquet reader 
(by setting `spark.sql.parquet.enableVectorizedReader` to `false`). That's also 
why #9940 couldn't take a similar approach as this one. Because 
`ParquetRecordReader` is a 3rd party class provided by parquet-mr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...

2016-12-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16156#discussion_r90973620
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
 ---
@@ -107,7 +107,16 @@ public void initialize(InputSplit inputSplit, 
TaskAttemptContext taskAttemptCont
   footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
   MessageType fileSchema = footer.getFileMetaData().getSchema();
   FilterCompat.Filter filter = getFilter(configuration);
-  blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
+  try {
+blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
+  } catch (IllegalArgumentException e) {
+// In the case where a particular parquet files does not contain
+// the column(s) in the filter, we don't do filtering at this level
+// PARQUET-389 will resolve this issue in Parquet 1.9, which may 
be used
+// by future Spark versions. This is a workaround for current 
Spark version.
+// Also the assumption here is that the predicates will be applied 
later
+blocks = footer.getBlocks();
--- End diff --

Thanks. We may just mention [SPARK-18140][1] here. It's for upgrading 
parquet-mr to 1.9.

[1]: https://issues.apache.org/jira/browse/SPARK-18140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...

2016-12-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16156#discussion_r90972713
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -578,4 +578,66 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-18539 - filtered by non-existing parquet column") {
+Seq("parquet").foreach { format =>
+  withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+withTempPath { path =>
+  import testImplicits._
+  Seq((1, "abc"), (2, "hello")).toDF("a", 
"b").write.format(format).save(path.toString)
+
+  // user-specified schema contains nonexistent columns
+  val schema = StructType(
+Seq(StructField("a", IntegerType),
+  StructField("b", StringType),
+  StructField("c", IntegerType)))
+  val readDf1 = 
spark.read.schema(schema).format(format).load(path.toString)
+
+  // Read the table without any filter
+  checkAnswer(readDf1, Row(1, "abc", null) :: Row(2, "hello", 
null) :: Nil)
+  // Read the table with a filter on existing columns
+  checkAnswer(readDf1.filter("a < 2"), Row(1, "abc", null) :: Nil)
+
+  // Read the table with a filter on nonexistent columns
+  checkAnswer(readDf1.filter("c < 2"), Nil)
+  checkAnswer(readDf1.filter("c is not null"), Nil)
+
+  checkAnswer(readDf1.filter("c is null"),
+Row(1, "abc", null) :: Row(2, "hello", null) :: Nil)
+
+  checkAnswer(readDf1.filter("c is null and a < 2"),
+Row(1, "abc", null) :: Nil)
+
+  // With another parquet file that contains the column "c"
+  Seq((2, "abc", 3), (3, "hello", 4)).toDF("a", "b", "c")
+.write.format(format).mode(SaveMode.Append).save(path.toString)
+
+  // Right now, there are 2 parquet files. one with the column "c",
+  // the other does not have it.
+  val readDf2 = 
spark.read.schema(schema).format(format).load(path.toString)
--- End diff --

Yea, what I concern is that we are using the vectorized Parquet reader by 
default now and may not have enough test coverage of the normal Parquet reader. 
Would be good to have test cases to prevent regression for the normal Parquet 
reader code path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...

2016-12-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16156#discussion_r90972121
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
 ---
@@ -107,7 +107,16 @@ public void initialize(InputSplit inputSplit, 
TaskAttemptContext taskAttemptCont
   footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
   MessageType fileSchema = footer.getFileMetaData().getSchema();
   FilterCompat.Filter filter = getFilter(configuration);
-  blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
+  try {
+blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
+  } catch (IllegalArgumentException e) {
+// In the case where a particular parquet files does not contain
+// the column(s) in the filter, we don't do filtering at this level
+// PARQUET-389 will resolve this issue in Parquet 1.9, which may 
be used
+// by future Spark versions. This is a workaround for current 
Spark version.
+// Also the assumption here is that the predicates will be applied 
later
--- End diff --

No matter filters are pushed down to Parquet reader or not, Spark will 
always apply all the filters again at a higher level to ensure that all filters 
are applied as expected. Spark treats data source filter push-down in a "best 
effort" manner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
@brkyvz I also worry about the behavior change. Let me check whether the 
original behavior is by design or by accident. If it is a bug from the very 
beginning, then we should just fix it in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16156
  
BTW, I think this PR is a cleaner fix than #9940, which introduces a 
temporary metadata while merging two `StructType`s and erased it in a later 
phase. We may want to remove the hack done in #9940 later if possible. But for 
now, let's make the fix as surgical as possible to lower the risk for 2.1 
release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16156: [SPARK-18539][SQL]: tolerate pushed-down filter on non-e...

2016-12-05 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16156
  
Actually, PR #9940 should have already fixed this issue. I'm checking why 
it doesn't work under 2.0.1 for 2.0.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...

2016-12-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16156#discussion_r90969603
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -578,4 +578,66 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-18539 - filtered by non-existing parquet column") {
+Seq("parquet").foreach { format =>
+  withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+withTempPath { path =>
+  import testImplicits._
+  Seq((1, "abc"), (2, "hello")).toDF("a", 
"b").write.format(format).save(path.toString)
+
+  // user-specified schema contains nonexistent columns
+  val schema = StructType(
+Seq(StructField("a", IntegerType),
+  StructField("b", StringType),
+  StructField("c", IntegerType)))
+  val readDf1 = 
spark.read.schema(schema).format(format).load(path.toString)
+
+  // Read the table without any filter
+  checkAnswer(readDf1, Row(1, "abc", null) :: Row(2, "hello", 
null) :: Nil)
+  // Read the table with a filter on existing columns
+  checkAnswer(readDf1.filter("a < 2"), Row(1, "abc", null) :: Nil)
+
+  // Read the table with a filter on nonexistent columns
+  checkAnswer(readDf1.filter("c < 2"), Nil)
+  checkAnswer(readDf1.filter("c is not null"), Nil)
+
+  checkAnswer(readDf1.filter("c is null"),
+Row(1, "abc", null) :: Row(2, "hello", null) :: Nil)
+
+  checkAnswer(readDf1.filter("c is null and a < 2"),
+Row(1, "abc", null) :: Nil)
+
+  // With another parquet file that contains the column "c"
+  Seq((2, "abc", 3), (3, "hello", 4)).toDF("a", "b", "c")
+.write.format(format).mode(SaveMode.Append).save(path.toString)
+
+  // Right now, there are 2 parquet files. one with the column "c",
+  // the other does not have it.
+  val readDf2 = 
spark.read.schema(schema).format(format).load(path.toString)
--- End diff --

In fact, we probably also want to test both the normal Parquet reader and 
the vectorized Parquet reader using the similar trick by setting  
`SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...

2016-12-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16156#discussion_r90968974
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -578,4 +578,66 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-18539 - filtered by non-existing parquet column") {
+Seq("parquet").foreach { format =>
+  withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+withTempPath { path =>
+  import testImplicits._
+  Seq((1, "abc"), (2, "hello")).toDF("a", 
"b").write.format(format).save(path.toString)
+
+  // user-specified schema contains nonexistent columns
+  val schema = StructType(
+Seq(StructField("a", IntegerType),
+  StructField("b", StringType),
+  StructField("c", IntegerType)))
+  val readDf1 = 
spark.read.schema(schema).format(format).load(path.toString)
+
+  // Read the table without any filter
+  checkAnswer(readDf1, Row(1, "abc", null) :: Row(2, "hello", 
null) :: Nil)
+  // Read the table with a filter on existing columns
+  checkAnswer(readDf1.filter("a < 2"), Row(1, "abc", null) :: Nil)
+
+  // Read the table with a filter on nonexistent columns
+  checkAnswer(readDf1.filter("c < 2"), Nil)
+  checkAnswer(readDf1.filter("c is not null"), Nil)
+
+  checkAnswer(readDf1.filter("c is null"),
+Row(1, "abc", null) :: Row(2, "hello", null) :: Nil)
+
+  checkAnswer(readDf1.filter("c is null and a < 2"),
+Row(1, "abc", null) :: Nil)
+
+  // With another parquet file that contains the column "c"
+  Seq((2, "abc", 3), (3, "hello", 4)).toDF("a", "b", "c")
+.write.format(format).mode(SaveMode.Append).save(path.toString)
+
+  // Right now, there are 2 parquet files. one with the column "c",
+  // the other does not have it.
+  val readDf2 = 
spark.read.schema(schema).format(format).load(path.toString)
--- End diff --

By default, we don't enable Parquet schema merging. Could you please 
reorganize this test case into something like this to have both code paths 
tested?

```scala
Seq(true, false).foreach { merge =>
  test("SPARK-18539 - filtered by non-existing parquet column - schema 
merging enabled: $merge") {
// ...
val readDf2 = spark.read
  .option("mergeSchema", merge.toString)
  .schema(schema)
  .format(format)
  .load(path.toString)
// ...
  }
}
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16156: [SPARK-18539][SQL]: tolerate pushed-down filter o...

2016-12-05 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/16156#discussion_r90968113
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
 ---
@@ -107,7 +107,16 @@ public void initialize(InputSplit inputSplit, 
TaskAttemptContext taskAttemptCont
   footer = readFooter(configuration, file, range(split.getStart(), 
split.getEnd()));
   MessageType fileSchema = footer.getFileMetaData().getSchema();
   FilterCompat.Filter filter = getFilter(configuration);
-  blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
+  try {
+blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema);
+  } catch (IllegalArgumentException e) {
+// In the case where a particular parquet files does not contain
+// the column(s) in the filter, we don't do filtering at this level
+// PARQUET-389 will resolve this issue in Parquet 1.9, which may 
be used
+// by future Spark versions. This is a workaround for current 
Spark version.
+// Also the assumption here is that the predicates will be applied 
later
+blocks = footer.getBlocks();
--- End diff --

Could you please add a TODO item int the comment so that we won't forget to 
remove this after upgrading to 1.9? Thanks!

```
// TODO Remove this hack after upgrading to parquet-mr 1.9.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16106: [SPARK-17213][SQL] Disable Parquet filter push-down for ...

2016-12-01 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16106
  
The last build failure doesn't seem to be relevant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16106: [SPARK-17213][SQL] Disable Parquet filter push-down for ...

2016-12-01 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16106
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16106: [SPARK-17213][SQL] Disable Parquet filter push-do...

2016-12-01 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/16106

[SPARK-17213][SQL] Disable Parquet filter push-down for string and binary 
columns due to PARQUET-686

## What changes were proposed in this pull request?

Due to PARQUET-686, Parquet doesn't do string comparison correctly while 
doing filter push-down for string columns. This PR disables filter push-down 
for both string and binary columns to work around this issue. Binary columns 
are also affected because some Parquet data models (like Hive) may store string 
columns as a plain Parquet `binary` instead of a `binary (UTF8)`.

## How was this patch tested?

New test case added in `ParquetFilterSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-17213-bad-string-ppd

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16106


commit d7162bfdb2a4389ea08cbb0d3627a429de6d7c45
Author: Cheng Lian <l...@databricks.com>
Date:   2016-12-01T19:52:20Z

Add test case

commit 810c1533ae9ba959e9696a806b22e19f8d0b5c31
Author: Cheng Lian <l...@databricks.com>
Date:   2016-12-01T19:54:09Z

Disable Parquet filter push-down for string and binary columns

commit ce71cca44009d2f8508e02291ac6872b680cd28a
Author: Cheng Lian <l...@databricks.com>
Date:   2016-12-01T19:59:37Z

Test more string comparison operators




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
My hunch is that we somehow passed a wrong requested schema containing the 
partition column down to the vectorized Parquet reader. IIRC, we prune 
partition columns from the data schema when generating the requested schema for 
the underlying reader since partition values are directly available in the 
directory path, there's no need to read and decode them from the physical file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
@maropu I tried your snippet (with minor modifications). It works for 1.6.0 
instead of 2.0.2:

```scala
case class A(a: Long, b: Int)
val as = Seq(A(1, 2))
val path = "/tmp/part"

sqlContext.createDataFrame(as).write.mode("overwrite").parquet(s"$path/a=1/")
val df = sqlContext.read.parquet(path)
df.printSchema()
df.collect()
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15979
  
Also backported to branch-2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15979
  
@rxin Shall we backport this to branch-2.1? I think it's relatively safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15979
  
Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16030
  
@brkyvz @maropu Actually, we do allow users to create partitioned tables 
that allow data schema to contains (part of) the partition columns, and there 
are [test][1] [cases][2] for this use case.

This use case is mostly useful when you are trying to reorganize an 
existing dataset into a partitioned form. Say you have a JSON dataset 
containing all the tweets in 2016 and you'd like to partition it by date. By 
allowing the data schema to contain partitioned columns, you may simply put 
JSON files of the same date into the same directory. Otherwise, you'll have to 
run an ETL job to erase the date column from the dataset, which can be 
time-consuming.

As for the query @maropu mentioned in the PR description, the query itself 
is problematic, because it lacks a user-specified schema to override the data 
type of the partitioned column `a`. Ideally, partition discovery should be able 
to fill in the correct data type `LongType`, but it's impossible since the 
directory path doesn't really expose that information. That's why a 
user-specified schema is necessary.

This query works in 2.0.2 because of the bug @brkyvz fixed: Spark ignores 
the data types of partitioned columns specified in the user-specified schema. 
Now the bug is fixed and this query doesn't work as expected.

In short:

1. This isn't a regression, the original query itself is problematic.
2. For this PR, we can either just close it or try to provide a better 
error message in the read path (ask the user to provide a user-specified 
schema) when:

   - A partitioned columns `p` also appears in the data schema, and
   - The discovered data type of `p` is different from the data type 
specified in the data schema.

   An alternative is to override the discovered partition column data type 
using the one in the data schema if any. But I'd say this change is probably 
too risky at this moment for 2.1.

[1]: 
https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala#L44-L66
[2]: 
https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala#L43-L66


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-30 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15979
  
Good to merge pending Jenkins. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15976: [SPARK-18403][SQL] Fix unsafe data false sharing issue i...

2016-11-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15976
  
@cloud-fan @dongjoon-hyun Thanks for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16061: [SPARK-18278] [Scheduler] Support native submission of s...

2016-11-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/16061
  
@erikerlandson For the RAT failure, you may either add Apache license 
header to newly added files or add the file to `dev/.rat-excludes`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15979
  
My only concern is that "non-flat" type is neither intuitive nor a 
well-known term. In fact, this PR only prevents `Option[T <: Product]` to be 
top-level Dataset types. How about just call them "Product" types?

Otherwise LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15979: [SPARK-18251][SQL] the type of Dataset can't be Option o...

2016-11-29 Thread liancheng
Github user liancheng commented on the issue:

https://github.com/apache/spark/pull/15979
  
LGTM, merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >