Re: parquet predicate / projection pushdown into unionAll

2014-10-01 Thread DB Tsai
Hi Cody and Michael, We ran into the same issue. Each day of data we have is stored into one parquet, and we want to query it against monthly parquet data. The data for each data is around 600GB, and we use 300 executors with 8GB memory for each executor. Without the patch, it took forever, and cr

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Michael Armbrust
Yeah, thanks for implementing it! Since Spark SQL is an alpha component and moving quickly the plan is to backport all of master into the next point release in the 1.1 series. On Fri, Sep 12, 2014 at 9:27 AM, Cody Koeninger wrote: > Cool, thanks for your help on this. Any chance of adding it t

Re: parquet predicate / projection pushdown into unionAll

2014-09-12 Thread Cody Koeninger
Cool, thanks for your help on this. Any chance of adding it to the 1.1.1 point release, assuming there ends up being one? On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust wrote: > Hey Cody, > > Thanks for doing this! Will look at your PR later today. > > Michael > > On Wed, Sep 10, 2014 at 9

Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Michael Armbrust
Hey Cody, Thanks for doing this! Will look at your PR later today. Michael On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger wrote: > Tested the patch against a cluster with some real data. Initial results > seem like going from one table to a union of 2 tables is now closer to a > doubling of

Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Cody Koeninger
Tested the patch against a cluster with some real data. Initial results seem like going from one table to a union of 2 tables is now closer to a doubling of query time as expected, instead of 5 to 10x. Let me know if you see any issues with that PR. On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninge

Re: parquet predicate / projection pushdown into unionAll

2014-09-10 Thread Cody Koeninger
So the obvious thing I was missing is that the analyzer has already resolved attributes by the time the optimizer runs, so the references in the filter / projection need to be fixed up to match the children. Created a PR, let me know if there's a better way to do it. I'll see about testing perfor

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Ok, so looking at the optimizer code for the first time and trying the simplest rule that could possibly work, object UnionPushdown extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { // Push down filter into union case f @ Filter(condition, u @ Union

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
What Patrick said is correct. Two other points: - In the 1.2 release we are hoping to beef up the support for working with partitioned parquet independent of the metastore. - You can actually do operations like INSERT INTO for parquet tables to add data. This creates new parquet files for each

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Patrick Wendell
I think what Michael means is people often use this to read existing partitioned Parquet tables that are defined in a Hive metastore rather than data generated directly from within Spark and then reading it back as a table. I'd expect the latter case to become more common, but for now most users co

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Maybe I'm missing something, I thought parquet was generally a write-once format and the sqlContext interface to it seems that way as well. d1.saveAsParquetFile("/foo/d1") // another day, another table, with same schema d2.saveAsParquetFile("/foo/d2") Will give a directory structure like /foo/d

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
I think usually people add these directories as multiple partitions of the same table instead of union. This actually allows us to efficiently prune directories when reading in addition to standard column pruning. On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf wrote: > I'm kind of surprised this

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Gary Malouf
I'm kind of surprised this was not run into before. Do people not segregate their data by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust wrote: > Thanks! > > On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger > wrote: > > > Opened > > > > https://issue

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger wrote: > Opened > > https://issues.apache.org/jira/browse/SPARK-3462 > > I'll take a look at ColumnPruning and see what I can do > > On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust > wrote: > >> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koen

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust wrote: > On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger > wrote: >> >> Is there a reason in general not to push projections and pr

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger wrote: > > Is there a reason in general not to push projections and predicates down > into the individual ParquetTableScans in a union? > This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA or even a PR :)

parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
I've been looking at performance differences between spark sql queries against single parquet tables, vs a unionAll of two tables. It's a significant difference, like 5 to 10x Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union