Aman,

That is exactly the clarification that I needed. I had a hazy memory of a
problem in this area, but not enough to actually figure out the current
state.

In case anybody cares, being able to do this is really handy. The basic
idea is to keep long history in files and recent history in a DB. That
allows you to create files with data that is advantageously sorted in order
to get excellent compression. You can get nearly atomic switch-over to
newly created files with lazy deletion of database entries by using a
reference to a cutoff date in a database row. The file side would only look
for data before the cutoff and the DB would only look for data after the
cut. By positioning new files (created by CTAS on an about to be obsolete
part of the DB) before changing the cutoff date, we get apparent atomicity.

After the switch, and after a reasonable delay beyond that (to let all
pending queries finish), the DB can be trimmed.

Without a working pushdown through unions, this is all kind of pointless.
If that is working now, it would be fabulous.

An example of how big a win this can be, consider a use case where we want
to keep all old states of customer preferences and context (say for a
mobile phone). Almost all of the hundreds of settings for an individual
would be unchanged even if a few do change. That means that if you could
arrange a day (or more) of data by user id, the columnar compression of
parquet would crush the data size. This only works, however, if you can
collect a fair number of rows for each user. Thus the idea of a hybrid
setup.



On Mon, Mar 19, 2018 at 11:57 PM, Aman Sinha <[email protected]> wrote:

> Due to an infinite loop occurring in Calcite planning, we had to disable
> the filter pushdown past the union (SetOps).  See
> https://issues.apache.org/jira/browse/DRILL-3855.
> Now that we have rebased on Calcite 1.15.0, we should re-enable this and
> test and if the pushdown works then the partition pruning on both sides of
> the union should automatically work after that.
>
> Will follow-up on this..
>
> -Aman
>
> On Mon, Mar 19, 2018 at 3:02 PM, Kunal Khatua <[email protected]>
> wrote:
>
> > I think Ted's question is 2 fold, with the former being more important.
> > 1. Can we push filters past a union.
> > 2. Will Drill push filters down to the source.
> >
> > For the latter, it depends on the source.
> > For the former, it depends primarily on whether Calcite supports this. I
> > haven't tried it, so I can't say.
> >
> > On 3/19/2018 2:22:54 PM, rahul challapalli <[email protected]>
> > wrote:
> > First I would suggest to ignore the view and try out a query which has
> the
> > required filters as part of the subqueries on both sides of the union
> (for
> > both the database and partitioned parquet data). The plan for such a
> query
> > should have the answers to your question. If both the subqueries
> > independently prune out un-necessary data, using partitions or indexes, I
> > don't think adding a union between them would alter that behavior.
> >
> > -Rahul
> >
> > On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning wrote:
> >
> > > IF I create a view that is a union of partitioned parquet files and a
> > > database that has secondary indexes, will Drill be able to properly
> push
> > > down query limits into both parts of the union?
> > >
> > > In particular, if I have lots of archival data and parquet partitioned
> by
> > > time but my query only asks for recent data that is in the database,
> will
> > > the query avoid the parquet files entirely (as you would wish)?
> > >
> > > Conversely, if the data I am asking for is entirely in the archive,
> will
> > > the query make use of the partitioning on my parquet files correctly?
> > >
> >
>

Reply via email to