Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
Could you also change the PR title and add SPARK-16321?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #63163 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63163/consoleFull)**
for PR 13701 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63163/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63161/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #63161 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63161/consoleFull)**
for PR 13701 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #63163 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63163/consoleFull)**
for PR 13701 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #63161 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63161/consoleFull)**
for PR 13701 at commit
Github user davies commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile In order to merge this patch sooner, it's better to only have
related changes to fix the regression. We can clean the dead code later.
---
If your project is set up for it, you can
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
Found more out-of-date comments:
-
https://github.com/apache/spark/blob/7742d9f1584150befeb2f3d76cdbd4ea1f37c914/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala#L98
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@viirya Can you also remove another dead code `overrideMinSplitSize`?
BTW, it sounds like these removed functions also set multiple internal and
external parameters. Should we still set
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
uh, I see. This is a regression! We did not call `setFilterPredicate` for
the original Parquet reader.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user maver1ck commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile
I added comment in Jira.
"spark.sql.parquet.filterPushdown has true as a default.
Vectorized Reader isn't a case here because I have nested columns (and
Vectorized Reader
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@maver1ck To use the original reader with filter push down, the best way is
to set `spark.sql.parquet.enableVectorizedReader` to false and
`spark.sql.parquet.filterPushdown` to true. Let us know
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@maver1ck IMO, in your case, vectorized parquet reader without filter push
down performs worse than the original one with filter push down. This is pretty
interesting.
---
If your project is
Github user maver1ck commented on the issue:
https://github.com/apache/spark/pull/13701
I think that this PR also resolves my problem here.
Github user davies commented on the issue:
https://github.com/apache/spark/pull/13701
@viirya In order to have a unit test for this (otherwise it will be broken
again in future), we could add some counter in vectorwized parquet reader for
row groups for test purpose, then use that to
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai As 2.0 is released now, do you have time to check this? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai OK. Thanks for letting me know that.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/13701
@viirya Thank you for updating this. Our schedules are pretty packed for
the release. We can take a look at it once 2.0 is released.
---
If your project is set up for it, you can reply to this email
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
ping @liancheng @yhuai again. Can you take a look? Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
ping @liancheng @yhuai Please check if this is ok now. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
Agreed. This is the most common performance issue in Big Data. Filter push
down will further deteriorate it.
---
If your project is set up for it, you can reply to this email and have your
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile That depends. In practice, we have many solutions to deal with
the case you mentioned. It is not making sense to keep so many tiny parquet
files.
---
If your project is set up for it,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
: ) Up to you. I think we will see less perfomance gains and more
significant performance penalty when a table contains many many small parquet
files. I also think that is expected, anyway.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile As I said, it is not an important issue here. What we want to
confirm is there is no significant performance penalty..
---
If your project is set up for it, you can reply to this email
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@viirya Yeah, it is not easy to get a full performance picture. I do not
know how Spark community did it in the past. When I working for the mainframe
team, we had dedicated PQAs for measuring
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile I think the times I run the benchmark is not enough to confirm
there is 5% performance difference. But I think it is not important here
because we don't want to measure the exact
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@viirya If you run multiple times and still see 5% performance difference,
you can confirm the penalty is around 5%. However, this might also depend on
the other factors, e.g., the total time.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile BTW, if the TPC-DS performance is measured with 2.0 codebase,
this should benefit the performance.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile You know that the benchmark results will not the same every
time even you run it with the same codes. If the difference is under a small
range, we can assume they have no significant
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
Great! Around 5% performance penalty looks OK to me. Maybe we can send the
code changes to the performance team for seeing TPC-DS improvement. CC @jfchen
---
If your project is set up for it,
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile This is the benchmark results. No significant difference.
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-25-generic
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@rdblue uh, I see. Thank you for your explanation! My above suggestion is
to confirm what you said in @viirya test cases. We expect to see the same
results as what you mentioned.
It
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile, we've not seen a penalty from running row group level tests
when no row groups are filtered and we've decided to turn on dictionary
filtering by default. You may see a penalty from
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@viirya Maybe you have not read my discussion with @rdblue . @rdblue
already explained how Parquet internally works. Like what I said above, I think
we still need a test for confirming whether
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai BTW, when reading more row groups, the performance improvement is
much more.
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61904/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61904 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61904/consoleFull)**
for PR 13701 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai I've run the test to check if row groups are actually filtered. I
manually print the number of blocks retrieved in
`SpecificParquetRecordReaderBase`. I set the block size to 1024 and rerun
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
> It sounds like the invalid rows will still be filtered out during column
assembly even if we are unable to prune any row group. If so, counting number
of rows does not help us determine how many
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61904 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61904/consoleFull)**
for PR 13701 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
I will update this soon..
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61879/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61879 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61879/consoleFull)**
for PR 13701 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61879 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61879/consoleFull)**
for PR 13701 at commit
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@rdblue Really thank you for your answer!
@viirya @yhuai I think the enhancement by this PR does not bring the extra
penalty that caused by row-level filtering. To confirm it, could
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/13701
Yeah, Parquet doesn't make a distinction for where filters are applied. If
you push a filter, then it will be applied to row groups if possible and
individual rows after that. But if you're
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@rdblue Really thank you for your reply!
Based on the following code description,
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile, sorry for the delay, I was evidently not getting notifications
until I changed some settings yesterday.
There are a few tests in Parquet that generate files with test data that
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai Ok. I will have it a try. But looks like I can only manually test it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/13701
Sorry. I am not sure I get it. We can set the row group size to a small
size. Then, it will not be hard to create a parquet file having multiple row
groups.
---
If your project is set up for it,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61147/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61147 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61147/consoleFull)**
for PR 13701 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai As I mentioned in the description, I am not sure if we can
manipulate row groups as we want, but I have manually tested it to show the
actually scanned row numbers.
---
If your project is
Github user yhuai commented on the issue:
https://github.com/apache/spark/pull/13701
Thank you for the testing. Can you also test the case that a file contains
multiple row groups and we can avoid of scanning unneeded ones?
Also since it is not fixing a critical bug, let's
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61147 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61147/consoleFull)**
for PR 13701 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61143 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61143/consoleFull)**
for PR 13701 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61143/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #61143 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61143/consoleFull)**
for PR 13701 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
ping @liancheng @yhuai again...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@liancheng @yhuai How about this? Is it ready to merge? Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@liancheng Is it possible to merge into before 2.0 release?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
ping @liancheng @yhuai I've updated the benchmark results. Please see if
you have other thoughts. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
@rdblue Not sure if you can help us? Thank you very much!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
See [the
comment](https://github.com/apache/spark/pull/13728#issuecomment-226696567) by
@HyukjinKwon the filter is applied more than once. That means, it could be
expensive if the filter does
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile No, I have no idea currently. To make sure the pushed down
filter is working, I've manually check the `totalRowCount` as shown in PR
description.
---
If your project is set up for it,
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
Based on my understanding, this performance testing is highly affected by
the number of row groups, the number of pruned groups and the number of average
rows per group. Do you have any thought
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@gatorsmile yea.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/13701
Just realized my PR https://github.com/apache/spark/pull/13728 is related
to your PR, especially the description of the two configuration
`spark.sql.parquet.filterPushdown` and
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@liancheng I've updated the benchmark. Please take a look. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60627/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #60627 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60627/consoleFull)**
for PR 13701 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/13701
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60628/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #60628 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60628/consoleFull)**
for PR 13701 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #60628 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60628/consoleFull)**
for PR 13701 at commit
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/13701
@yhuai I am not sure the row-group info is exposed. But I've manually
output the `totalRowCount` in `SpecificParquetRecordReaderBase` to check the
total number of rows this `RecordReader` will
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/13701
**[Test build #60627 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60627/consoleFull)**
for PR 13701 at commit
86 matches
Mail list logo