[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 Could you also change the PR title and add SPARK-16321? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #63163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63163/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63163/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63161/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #63161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63161/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #63163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63163/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #63161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63161/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread davies
Github user davies commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile In order to merge this patch sooner, it's better to only have related changes to fix the regression. We can clean the dead code later. --- If your project is set up for it, you can

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 Found more out-of-date comments: - https://github.com/apache/spark/blob/7742d9f1584150befeb2f3d76cdbd4ea1f37c914/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala#L98

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @viirya Can you also remove another dead code `overrideMinSplitSize`? BTW, it sounds like these removed functions also set multiple internal and external parameters. Should we still set

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 uh, I see. This is a regression! We did not call `setFilterPredicate` for the original Parquet reader. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread maver1ck
Github user maver1ck commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile I added comment in Jira. "spark.sql.parquet.filterPushdown has true as a default. Vectorized Reader isn't a case here because I have nested columns (and Vectorized Reader

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @maver1ck To use the original reader with filter push down, the best way is to set `spark.sql.parquet.enableVectorizedReader` to false and `spark.sql.parquet.filterPushdown` to true. Let us know

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @maver1ck IMO, in your case, vectorized parquet reader without filter push down performs worse than the original one with filter push down. This is pretty interesting. --- If your project is

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread maver1ck
Github user maver1ck commented on the issue: https://github.com/apache/spark/pull/13701 I think that this PR also resolves my problem here.

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-02 Thread davies
Github user davies commented on the issue: https://github.com/apache/spark/pull/13701 @viirya In order to have a unit test for this (otherwise it will be broken again in future), we could add some counter in vectorwized parquet reader for row groups for test purpose, then use that to

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-08-01 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai As 2.0 is released now, do you have time to check this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-13 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai OK. Thanks for letting me know that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-13 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13701 @viirya Thank you for updating this. Our schedules are pretty packed for the release. We can take a look at it once 2.0 is released. --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-12 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 ping @liancheng @yhuai again. Can you take a look? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-10 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 ping @liancheng @yhuai Please check if this is ok now. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-08 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 Agreed. This is the most common performance issue in Big Data. Filter push down will further deteriorate it. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile That depends. In practice, we have many solutions to deal with the case you mentioned. It is not making sense to keep so many tiny parquet files. --- If your project is set up for it,

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 : ) Up to you. I think we will see less perfomance gains and more significant performance penalty when a table contains many many small parquet files. I also think that is expected, anyway.

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile As I said, it is not an important issue here. What we want to confirm is there is no significant performance penalty.. --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @viirya Yeah, it is not easy to get a full performance picture. I do not know how Spark community did it in the past. When I working for the mainframe team, we had dedicated PQAs for measuring

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile I think the times I run the benchmark is not enough to confirm there is 5% performance difference. But I think it is not important here because we don't want to measure the exact

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @viirya If you run multiple times and still see 5% performance difference, you can confirm the penalty is around 5%. However, this might also depend on the other factors, e.g., the total time.

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile BTW, if the TPC-DS performance is measured with 2.0 codebase, this should benefit the performance. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile You know that the benchmark results will not the same every time even you run it with the same codes. If the difference is under a small range, we can assume they have no significant

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 Great! Around 5% performance penalty looks OK to me. Maybe we can send the code changes to the performance team for seeing TPC-DS improvement. CC @jfchen --- If your project is set up for it,

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile This is the benchmark results. No significant difference. Before this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @rdblue uh, I see. Thank you for your explanation! My above suggestion is to confirm what you said in @viirya test cases. We expect to see the same results as what you mentioned. It

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile, we've not seen a penalty from running row group level tests when no row groups are filtered and we've decided to turn on dictionary filtering by default. You may see a penalty from

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @viirya Maybe you have not read my discussion with @rdblue . @rdblue already explained how Parquet internally works. Like what I said above, I think we still need a test for confirming whether

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai BTW, when reading more row groups, the performance improvement is much more. Before this patch: Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61904/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61904/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai I've run the test to check if row groups are actually filtered. I manually print the number of blocks retrieved in `SpecificParquetRecordReaderBase`. I set the block size to 1024 and rerun

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 > It sounds like the invalid rows will still be filtered out during column assembly even if we are unable to prune any row group. If so, counting number of rows does not help us determine how many

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-07 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61904 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61904/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-06 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 I will update this soon.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61879/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61879 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61879/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-07-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61879 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61879/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @rdblue Really thank you for your answer! @viirya @yhuai I think the enhancement by this PR does not bring the extra penalty that caused by row-level filtering. To confirm it, could

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13701 Yeah, Parquet doesn't make a distinction for where filters are applied. If you push a filter, then it will be applied to row groups if possible and individual rows after that. But if you're

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @rdblue Really thank you for your reply! Based on the following code description,

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile, sorry for the delay, I was evidently not getting notifications until I changed some settings yesterday. There are a few tests in Parquet that generate files with test data that

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai Ok. I will have it a try. But looks like I can only manually test it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13701 Sorry. I am not sure I get it. We can set the row group size to a small size. Then, it will not be hard to create a parquet file having multiple row groups. --- If your project is set up for it,

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61147/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-24 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61147/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai As I mentioned in the description, I am not sure if we can manipulate row groups as we want, but I have manually tested it to show the actually scanned row numbers. --- If your project is

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread yhuai
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/13701 Thank you for the testing. Can you also test the case that a file contains multiple row groups and we can avoid of scanning unneeded ones? Also since it is not fixing a critical bug, let's

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61147/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so,

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61143 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61143/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61143/ Test FAILed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #61143 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61143/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-23 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 ping @liancheng @yhuai again... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @liancheng @yhuai How about this? Is it ready to merge? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @liancheng Is it possible to merge into before 2.0 release? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-20 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 ping @liancheng @yhuai I've updated the benchmark results. Please see if you have other thoughts. Thanks! --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 @rdblue Not sure if you can help us? Thank you very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 See [the comment](https://github.com/apache/spark/pull/13728#issuecomment-226696567) by @HyukjinKwon the filter is applied more than once. That means, it could be expensive if the filter does

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-17 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile No, I have no idea currently. To make sure the pushed down filter is working, I've manually check the `totalRowCount` as shown in PR description. --- If your project is set up for it,

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 Based on my understanding, this performance testing is highly affected by the number of row groups, the number of pruned groups and the number of average rows per group. Do you have any thought

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-17 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @gatorsmile yea. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-17 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/13701 Just realized my PR https://github.com/apache/spark/pull/13728 is related to your PR, especially the description of the two configuration `spark.sql.parquet.filterPushdown` and

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @liancheng I've updated the benchmark. Please take a look. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60627/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #60627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60627/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60628/ Test PASSed. ---

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-16 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #60628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60628/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #60628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60628/consoleFull)** for PR 13701 at commit

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-15 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/13701 @yhuai I am not sure the row-group info is exposed. But I've manually output the `totalRowCount` in `SpecificParquetRecordReaderBase` to check the total number of rows this `RecordReader` will

[GitHub] spark issue #13701: [SPARK-15639][SQL] Try to push down filter at RowGroups ...

2016-06-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13701 **[Test build #60627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60627/consoleFull)** for PR 13701 at commit