[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165379694 @gatorsmile Sorry for the late reply and thanks for the nice catch! The `In` predicate push down issue had been tracked by SPARK-11164, and done as part of PR #8956. Unfortunately that we didn't merge that PR due to other problems in it. Could you please add SPARK-11164 to your PR title? For the `Not` push-down rule: 1. I'm for adding it to branch-1.5 since it's a pretty safe one. 2. I think we might also want to add more general [CNF][1] conversion rule to master, which should be done in a separate PR, of course. Since we don't have existential / universal quantifier in our predicates, I think CNF conversion in Spark SQL can be as simple as keeping pushing `Not` and `Or` inward (or downward) using De Morgan's laws and the distributive law: ```scala object CNFConversion extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan transform { case filter: Filter => import org.apache.spark.sql.catalyst.dsl.expressions._ filter.copy(condition = filter.condition.transform { case Not(x Or y) => !x && !y case Not(x And y) => !x || !y case (x And y) Or z => (x || z) && (y || z) case x Or (y And z) => (x || y) && (x || z) }) } } ``` (Notice that this version doesn't handle common expression elimination.) That said, the `Not` push-down rule is actually a subset of CNF conversion. There had once been a PR aimed to add CNF conversion for data source filter push-down only, but wasn't merged (see SPARK-6624 and PR #6713). As @marmbrus commented there, CNF conversion might be worth adding to the optimizer. @rxin @marmbrus Not super confident about the CNF conversion conclusion above, please correct me if I'm wrong. [1]: https://en.wikipedia.org/wiki/Conjunctive_normal_form --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165320679 Yeah, you can say that. For example, the original filter is ```not (a = 2 and b in ('1', '2'))```. However, Spark 1.5.2 only pushes down ```not (a = 2)```. Thus, the returned data from Parquet is incomplete and thus data loss happens. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165334450 https://github.com/apache/spark/pull/10344 shows that the test fails with out 1.5. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165339904 @gatorsmile @liancheng Looks like we only push a part of the predicate down if we do not understand other parts. Is there any other kind of combinations that can trigger this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165340920 @yhuai Based on my understanding, if including the fix of `IN` in this PR, we have covered all the filters. The only exceptions are the ones explained in https://issues.apache.org/jira/browse/SPARK-11153 Since 1.6 already has the fix (https://github.com/apache/spark/pull/5700) that can push `Not` operator to the inner most level, we can say 1.6 is not affected by the bug even if some filters are not pushed down. Please correct me if anything is not appropriate, @liancheng Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165297020 Yeah, it works without https://github.com/apache/spark/pull/5700. However, I still hope we can backport https://github.com/apache/spark/pull/5700. Without it, it will not push down the these filters to Parquet. That means, it will have a negative performance impact. If you need it, I also can create another JIRA for backporting https://github.com/apache/spark/pull/5700 Please let me know your opinions. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165298276 Sure, will do it tonight. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165301819 @gatorsmile So, the problem is Spark SQL generates wrong parquet filter? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165296448 @gatorsmile how about we also create a jira against 1.5? So, we can use that to test the fix (later when we merge PR, we can merge this one if there is no conflict. Otherwise, we merge that one to 1.5 and merge this one to 1.6 and master). Also, do we need to backport #5700 to 1.5? Without it, your fix also works, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-165297940 @gatorsmile Can you create a pr for 1.5? We can do this. The first commit is to just have your test case. Then, our jenkins should fail. Finally, we add your fix and jenkins should be good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/10278#discussion_r47877766 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -265,7 +268,10 @@ private[sql] object ParquetFilters { rhsFilter <- createFilter(schema, rhs) } yield FilterApi.or(lhsFilter, rhsFilter) - case sources.Not(pred) => + // Here, we assume the Optimizer's rule BooleanSimplification has pushed `Not` operator + // to the inner most level. + case sources.Not(pred) +if !pred.isInstanceOf[sources.And] && !pred.isInstanceOf[sources.Or] => --- End diff -- Nit: The following version might be clearer: ```scala // (Copy your comment here) case sources.Not(_: sources.And) | sources.Not(_: sources.Or) => None case sources.Not(pred) => createFilter(schema, pred).map(FilterApi.not) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164604885 @liancheng can you look at this? Seems pretty serious if we are returning wrong answers. /cc @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164265237 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47623/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164265233 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164264969 **[Test build #47623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47623/consoleFull)** for PR 10278 at commit [`50733c6`](https://github.com/apache/spark/commit/50733c6239b721ecb1f0691bb3d4680235c15a18). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164255495 **[Test build #47623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47623/consoleFull)** for PR 10278 at commit [`50733c6`](https://github.com/apache/spark/commit/50733c6239b721ecb1f0691bb3d4680235c15a18). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164232814 After reading the source codes, it does not make sense we do not push down `IN` to Parquet in the above example: ```"not (a = 2 and b in ('1', '2'))"```. We should fix these two issues in both 1.5.x and 1.6.x --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/10278 [SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when `Not` is included in Parquet filter pushdown When applying the operator `Not`, the current generation rule for Parquet filters simply applies `Not` to all the inclusive/underlying filters. For example, when the filter is ```"not (a = 2 and b in ('1', '2'))"```, the generated filter is ```not (a=2)```. When we push down this filter to Parquet, it will remove all the eligible rows satisfying the condition ```not(b in ('1', '2'))``` In the current 1.6, the Optimizer's rule BooleanSimplification added the following new rules in the PR(https://github.com/apache/spark/pull/5700): (BTW, should we move this to analyzer?) ``` not(A and B) => not(A) or not(B) not(A or B) => not(A) and not(B) ``` I do not think we should redo it in the Parquet filter generation. Thus, I just added a condition to avoid the incorrect results in case the Optimizer is unable to handle all the cases. **Question**: how can we include the PR https://github.com/apache/spark/pull/5700 into 1.5? Do you need me to submit a new PR for 1.5? Or you can do it? This is a critical PR because the result will be incorrect without the fix. CC the original reviewers of https://github.com/apache/spark/pull/5700: @marmbrus @cloud-fan Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark parquetFilterNot Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10278.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10278 commit 79be2c3581551ab24273f3da472269814d0d736e Author: gatorsmileDate: 2015-12-12T18:10:16Z added a condition for `Not` operator in ParquetFilter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164190262 **[Test build #47616 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47616/consoleFull)** for PR 10278 at commit [`2ff70bf`](https://github.com/apache/spark/commit/2ff70bfac2c9be9e75cc7840dd3844854f565325). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164175763 **[Test build #47615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47615/consoleFull)** for PR 10278 at commit [`79be2c3`](https://github.com/apache/spark/commit/79be2c3581551ab24273f3da472269814d0d736e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164178387 After reading the other push-down PR, I think it also needs a review from @liancheng . Welcome any comment! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164188245 **[Test build #47615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47615/consoleFull)** for PR 10278 at commit [`79be2c3`](https://github.com/apache/spark/commit/79be2c3581551ab24273f3da472269814d0d736e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164188283 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47615/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164188282 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164209122 **[Test build #47618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47618/consoleFull)** for PR 10278 at commit [`c9af771`](https://github.com/apache/spark/commit/c9af771adb998b54c8bfcbdf64ac4fc1b82d14ad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164213638 **[Test build #47618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47618/consoleFull)** for PR 10278 at commit [`c9af771`](https://github.com/apache/spark/commit/c9af771adb998b54c8bfcbdf64ac4fc1b82d14ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164213704 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47618/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164213703 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164203719 Its fine if the test only fails on 1.5 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164204488 Great! : ) Let me also post the test case I did in the latest 1.5. Without my fix, the first call of show() did not return the row (2, 0). ```scala withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { withTempPath { dir => val path = s"${dir.getCanonicalPath}/table1" (1 to 5).map(i => (i, (i%2).toString)).toDF("a", "b").write.parquet(path) val df = sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))") df.show() val df1 = sqlContext.read.parquet(path).where("not (a = 2) or not(b in ('1'))") df1.show() } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164204557 I might find another bug in Parquet pushdown. Will submit another PR later when I can confirm it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164202075 Do you have a test case that actually shows a wrong answer being computed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164202142 This only happens in 1.5. Do you need me to write a test case for 1.5? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164202611 Any bug fix should have a regression test. We could always change the optimizer in a way that does not hide this bug anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164202727 Ok, will make a try to force it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164198466 **[Test build #47616 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47616/consoleFull)** for PR 10278 at commit [`2ff70bf`](https://github.com/apache/spark/commit/2ff70bfac2c9be9e75cc7840dd3844854f565325). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164198540 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47616/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12218] [SQL] Fixed the Parquet's filter...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10278#issuecomment-164198539 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org