Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
In light of @HyukjinKwon's benchmark it seems like Spark-side filtering is
the right thing to do here, so I think this should be good?
---
If your project is set up for it, you can reply to th
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
Thanks for confirming this. I will work on this.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have thi
Github user davies commented on the issue:
https://github.com/apache/spark/pull/14671
@HyukjinKwon That sounds good, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabl
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
@davies Do you mind if I ask whether it is sensible to perform a benchmark
and try to submit a PR to disable this (maybe with adding an extra option to
enable/disable this but false by default)?
Github user davies commented on the issue:
https://github.com/apache/spark/pull/14671
@andreweduffy Good point, but we still use the parquet-mr when there is any
complex type in the schema.
---
If your project is set up for it, you can reply to this email and have your
reply appear o
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
@davies Row-level filtering doesn't occur with the vectorized reader, which
is now enabled by default
---
If your project is set up for it, you can reply to this email and have your
reply appe
Github user davies commented on the issue:
https://github.com/apache/spark/pull/14671
Before disable the record level filter in parquet reader, I think pushing
more non-efficient predicates into parquet reader will be even worse, right?
---
If your project is set up for it, you can r
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
cool, ping to @davies @cloud-fan would either of you be able to look at
this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
@andreweduffy Yup, filed here,
https://issues.apache.org/jira/browse/SPARK-17310.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
@HyukjinKwon would you like to file a separate ticket for benchmarking?
It's pretty orthogonal to this PR, see rdblue's comment above.
---
If your project is set up for it, you can reply to th
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
@ash211 I am happy to do so but I would like to make sure if there is a
offline benchmark performed already and if we can disable this if the
performance is better. I don't want to duplicate som
Github user ash211 commented on the issue:
https://github.com/apache/spark/pull/14671
@HyukjinKwon do you have time to work on that benchmark over the next week?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pro
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
cc @davies @cloud-fan for parquet change, seems I got @rdblue's stamp of
approval
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
@rxin Did you get the chance to take a closer look at this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/14671
@andreweduffy's comments about this make sense to me. Improving the filters
that are pushed is a good idea, even if we decide to disable Parquet's
row-by-row filtering.
The option to disable
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
Yeah benchmarking is definitely a great idea, as it is likely Spark will be
better than Parquet at filtering individual records, but I'm still not quite
understanding why this filter is any dif
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
@andreweduffy @rxin Maybe I can go for the simple benchmark quickly (maybe
within this weekend) and open a PR to disable Parquet row-by-row filtering if
it makes sense and this can be the reason
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
That is true, but currently all filters are being pushed down to row-by-row
anyway when not using the vectorized reader, so I'm unclear why the IN filter
is special
---
If your project is set
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
I mean, maybe we should disable the row-by-row one in Parquet with a proper
benchmark first before handling `In` here.
---
If your project is set up for it, you can reply to this email and have
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
Yea, that is all true. Actually, it would be okay just not to pass the
filter
[here](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apa
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
Thanks for the comments guys! Had to search through some code, but I think
I understand the current state of things. Correct me if I'm wrong, but it seems
that record-by-record filtering only o
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/14671
Yea unfortunately the row-by-row filtering doesn't make much sense in
Parquet.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
Thanks for cc me! As you might already know, I think it makes sense
allowing to filter rowgroups but this will be also applied to row-by-row for
normal parquet reader and this was removed by
[S
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14671
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63877/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14671
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14671
**[Test build #63877 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63877/consoleFull)**
for PR 14671 at commit
[`1c9cf7b`](https://github.com/apache/spark/commit/
Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
cc @HyukjinKwon @rdblue for Parquet-related change
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14671
**[Test build #63877 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63877/consoleFull)**
for PR 14671 at commit
[`1c9cf7b`](https://github.com/apache/spark/commit/1
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14671
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63867/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14671
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14671
**[Test build #63867 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63867/consoleFull)**
for PR 14671 at commit
[`7679285`](https://github.com/apache/spark/commit/
31 matches
Mail list logo