Hi Cody and Michael,
We ran into the same issue. Each day of data we have is stored into
one parquet, and we want to query it against monthly parquet data. The
data for each data is around 600GB, and we use 300 executors with 8GB
memory for each executor. Without the patch, it took forever, and
cr
Yeah, thanks for implementing it!
Since Spark SQL is an alpha component and moving quickly the plan is to
backport all of master into the next point release in the 1.1 series.
On Fri, Sep 12, 2014 at 9:27 AM, Cody Koeninger wrote:
> Cool, thanks for your help on this. Any chance of adding it t
Cool, thanks for your help on this. Any chance of adding it to the 1.1.1
point release, assuming there ends up being one?
On Wed, Sep 10, 2014 at 11:39 AM, Michael Armbrust
wrote:
> Hey Cody,
>
> Thanks for doing this! Will look at your PR later today.
>
> Michael
>
> On Wed, Sep 10, 2014 at 9
Hey Cody,
Thanks for doing this! Will look at your PR later today.
Michael
On Wed, Sep 10, 2014 at 9:31 AM, Cody Koeninger wrote:
> Tested the patch against a cluster with some real data. Initial results
> seem like going from one table to a union of 2 tables is now closer to a
> doubling of
Tested the patch against a cluster with some real data. Initial results
seem like going from one table to a union of 2 tables is now closer to a
doubling of query time as expected, instead of 5 to 10x.
Let me know if you see any issues with that PR.
On Wed, Sep 10, 2014 at 8:19 AM, Cody Koeninge
So the obvious thing I was missing is that the analyzer has already
resolved attributes by the time the optimizer runs, so the references in
the filter / projection need to be fixed up to match the children.
Created a PR, let me know if there's a better way to do it. I'll see about
testing perfor
Ok, so looking at the optimizer code for the first time and trying the
simplest rule that could possibly work,
object UnionPushdown extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
// Push down filter into
union
case f @ Filter(condition, u @ Union
What Patrick said is correct. Two other points:
- In the 1.2 release we are hoping to beef up the support for working with
partitioned parquet independent of the metastore.
- You can actually do operations like INSERT INTO for parquet tables to
add data. This creates new parquet files for each
I think what Michael means is people often use this to read existing
partitioned Parquet tables that are defined in a Hive metastore rather
than data generated directly from within Spark and then reading it
back as a table. I'd expect the latter case to become more common, but
for now most users co
Maybe I'm missing something, I thought parquet was generally a write-once
format and the sqlContext interface to it seems that way as well.
d1.saveAsParquetFile("/foo/d1")
// another day, another table, with same schema
d2.saveAsParquetFile("/foo/d2")
Will give a directory structure like
/foo/d
I think usually people add these directories as multiple partitions of the
same table instead of union. This actually allows us to efficiently prune
directories when reading in addition to standard column pruning.
On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf wrote:
> I'm kind of surprised this
I'm kind of surprised this was not run into before. Do people not
segregate their data by day/week in the HDFS directory structure?
On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust
wrote:
> Thanks!
>
> On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger
> wrote:
>
> > Opened
> >
> > https://issue
Thanks!
On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger wrote:
> Opened
>
> https://issues.apache.org/jira/browse/SPARK-3462
>
> I'll take a look at ColumnPruning and see what I can do
>
> On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust
> wrote:
>
>> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koen
Opened
https://issues.apache.org/jira/browse/SPARK-3462
I'll take a look at ColumnPruning and see what I can do
On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust
wrote:
> On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger
> wrote:
>>
>> Is there a reason in general not to push projections and pr
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger wrote:
>
> Is there a reason in general not to push projections and predicates down
> into the individual ParquetTableScans in a union?
>
This would be a great case to add to ColumnPruning. Would be awesome if
you could open a JIRA or even a PR :)
I've been looking at performance differences between spark sql queries
against single parquet tables, vs a unionAll of two tables. It's a
significant difference, like 5 to 10x
Is there a reason in general not to push projections and predicates down
into the individual ParquetTableScans in a union
16 matches
Mail list logo