[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-27 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 In light of @HyukjinKwon's benchmark it seems like Spark-side filtering is the right thing to do here, so I think this should be good? --- If your project is set up for it, you can reply to th

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 Thanks for confirming this. I will work on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have thi

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-09 Thread davies
Github user davies commented on the issue: https://github.com/apache/spark/pull/14671 @HyukjinKwon That sounds good, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabl

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 @davies Do you mind if I ask whether it is sensible to perform a benchmark and try to submit a PR to disable this (maybe with adding an extra option to enable/disable this but false by default)?

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread davies
Github user davies commented on the issue: https://github.com/apache/spark/pull/14671 @andreweduffy Good point, but we still use the parquet-mr when there is any complex type in the schema. --- If your project is set up for it, you can reply to this email and have your reply appear o

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 @davies Row-level filtering doesn't occur with the vectorized reader, which is now enabled by default --- If your project is set up for it, you can reply to this email and have your reply appe

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread davies
Github user davies commented on the issue: https://github.com/apache/spark/pull/14671 Before disable the record level filter in parquet reader, I think pushing more non-efficient predicates into parquet reader will be even worse, right? --- If your project is set up for it, you can r

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-09-06 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 cool, ping to @davies @cloud-fan would either of you be able to look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 @andreweduffy Yup, filed here, https://issues.apache.org/jira/browse/SPARK-17310. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well.

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-30 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 @HyukjinKwon would you like to file a separate ticket for benchmarking? It's pretty orthogonal to this PR, see rdblue's comment above. --- If your project is set up for it, you can reply to th

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-29 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 @ash211 I am happy to do so but I would like to make sure if there is a offline benchmark performed already and if we can disable this if the performance is better. I don't want to duplicate som

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-29 Thread ash211
Github user ash211 commented on the issue: https://github.com/apache/spark/pull/14671 @HyukjinKwon do you have time to work on that benchmark over the next week? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your pro

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-22 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 cc @davies @cloud-fan for parquet change, seems I got @rdblue's stamp of approval --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-18 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 @rxin Did you get the chance to take a closer look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/14671 @andreweduffy's comments about this make sense to me. Improving the filters that are pushed is a good idea, even if we decide to disable Parquet's row-by-row filtering. The option to disable

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 Yeah benchmarking is definitely a great idea, as it is likely Spark will be better than Parquet at filtering individual records, but I'm still not quite understanding why this filter is any dif

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 @andreweduffy @rxin Maybe I can go for the simple benchmark quickly (maybe within this weekend) and open a PR to disable Parquet row-by-row filtering if it makes sense and this can be the reason

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 That is true, but currently all filters are being pushed down to row-by-row anyway when not using the vectorized reader, so I'm unclear why the IN filter is special --- If your project is set

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 I mean, maybe we should disable the row-by-row one in Parquet with a proper benchmark first before handling `In` here. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 Yea, that is all true. Actually, it would be okay just not to pass the filter [here](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apa

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-17 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 Thanks for the comments guys! Had to search through some code, but I think I understand the current state of things. Correct me if I'm wrong, but it seems that record-by-record filtering only o

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14671 Yea unfortunately the row-by-row filtering doesn't make much sense in Parquet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/14671 Thanks for cc me! As you might already know, I think it makes sense allowing to filter rowgroups but this will be also applied to row-by-row for normal parquet reader and this was removed by [S

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14671 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63877/ Test PASSed. ---

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14671 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14671 **[Test build #63877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63877/consoleFull)** for PR 14671 at commit [`1c9cf7b`](https://github.com/apache/spark/commit/

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread andreweduffy
Github user andreweduffy commented on the issue: https://github.com/apache/spark/pull/14671 cc @HyukjinKwon @rdblue for Parquet-related change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14671 **[Test build #63877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63877/consoleFull)** for PR 14671 at commit [`1c9cf7b`](https://github.com/apache/spark/commit/1

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14671 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63867/ Test PASSed. ---

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14671 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #14671: [SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq

2016-08-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14671 **[Test build #63867 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63867/consoleFull)** for PR 14671 at commit [`7679285`](https://github.com/apache/spark/commit/