Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165379694
@gatorsmile Sorry for the late reply and thanks for the nice catch!
The `In` predicate push down issue had been tracked by SPARK-11164, and
done as part of
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165320679
Yeah, you can say that.
For example, the original filter is ```not (a = 2 and b in ('1', '2'))```.
However, Spark 1.5.2 only pushes down ```not (a = 2)```.
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165334450
https://github.com/apache/spark/pull/10344 shows that the test fails with
out 1.5.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165339904
@gatorsmile @liancheng Looks like we only push a part of the predicate down
if we do not understand other parts. Is there any other kind of combinations
that can trigger
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165340920
@yhuai Based on my understanding, if including the fix of `IN` in this PR,
we have covered all the filters. The only exceptions are the ones explained in
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165297020
Yeah, it works without https://github.com/apache/spark/pull/5700.
However, I still hope we can backport
https://github.com/apache/spark/pull/5700. Without
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165298276
Sure, will do it tonight. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165301819
@gatorsmile So, the problem is Spark SQL generates wrong parquet filter?
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165296448
@gatorsmile how about we also create a jira against 1.5? So, we can use
that to test the fix (later when we merge PR, we can merge this one if there is
no conflict.
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-165297940
@gatorsmile Can you create a pr for 1.5? We can do this. The first commit
is to just have your test case. Then, our jenkins should fail. Finally, we add
your fix and
Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/10278#discussion_r47877766
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
---
@@ -265,7 +268,10 @@ private[sql] object
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164604885
@liancheng can you look at this? Seems pretty serious if we are returning
wrong answers.
/cc @yhuai
---
If your project is set up for it, you can reply to
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164265237
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164265233
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164264969
**[Test build #47623 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47623/consoleFull)**
for PR 10278 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164255495
**[Test build #47623 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47623/consoleFull)**
for PR 10278 at commit
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164232814
After reading the source codes, it does not make sense we do not push down
`IN` to Parquet in the above example:
```"not (a = 2 and b in ('1', '2'))"```.
GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/10278
[SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when `Not`
is included in Parquet filter pushdown
When applying the operator `Not`, the current generation rule for Parquet
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164190262
**[Test build #47616 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47616/consoleFull)**
for PR 10278 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164175763
**[Test build #47615 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47615/consoleFull)**
for PR 10278 at commit
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164178387
After reading the other push-down PR, I think it also needs a review from
@liancheng . Welcome any comment! Thanks!
---
If your project is set up for it, you can
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164188245
**[Test build #47615 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47615/consoleFull)**
for PR 10278 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164188283
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164188282
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164209122
**[Test build #47618 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47618/consoleFull)**
for PR 10278 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164213638
**[Test build #47618 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47618/consoleFull)**
for PR 10278 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164213704
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164213703
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164203719
Its fine if the test only fails on 1.5
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164204488
Great! : )
Let me also post the test case I did in the latest 1.5. Without my fix, the
first call of show() did not return the row (2, 0).
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164204557
I might find another bug in Parquet pushdown. Will submit another PR later
when I can confirm it.
---
If your project is set up for it, you can reply to this
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164202075
Do you have a test case that actually shows a wrong answer being computed?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164202142
This only happens in 1.5. Do you need me to write a test case for 1.5?
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164202611
Any bug fix should have a regression test. We could always change the
optimizer in a way that does not hide this bug anymore.
---
If your project is set up for it,
Github user gatorsmile commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164202727
Ok, will make a try to force it. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164198466
**[Test build #47616 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47616/consoleFull)**
for PR 10278 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164198540
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/10278#issuecomment-164198539
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
38 matches
Mail list logo