Re: querying from multiple directories in S3

Wesley Chow Wed, 10 May 2017 10:28:51 -0700

Yep sure.. so the directory structure looks something like:

root/2017/03/17/00/00/{300 - 400 CSV files}
root/2017/03/17/00/30/{300 - 400 CSV files}
root/2017/03/17/01/00/{300 - 400 CSV files}
...

And there's perhaps a year worth of files, only increasing over time. Our
queries tend to run over no more than day's worth of data, usually 2 - 6
hours. Right now these files import into Redshift where we run our queries,
but what I'm working on is assessing the feasibility of directly querying
against the CSV files. Our queries run every hour or a bit more frequent
than that, so if we're looking back over a span of multiple hours we need
to be able to query across day/month boundaries.

The files contain time series data, so each row corresponds to an event
with multiple dimensions and metrics. Our queries tend to be rollups where
we group by some subset of dimensions and aggregate up the metrics. So a
basic example might be, every hour I want the top ten counts grouped by
some dimension. The basic structure of the query would be:

SELECT d, count(1) AS c FROM ... GROUP BY d ORDER BY c DESC LIMIT 10;

The question I'm asking is, what goes into ... or do I use WHERE clauses
suggested by Charles Givre in a different answer?

For performance reasons, we probably want to be converting these into
parquet, so I'm also looking at whether or not the table metadata feature
would make this easier. In this case, I'd put the time of the event in the
rows, and query against the root directory with WHERE clauses that restrict
to the range of my query. It seems, though, that refreshing the metadata
table may be prohibitively expensive. In my test, a refresh of just a day's
worth of data takes about 5 minutes. I don't understand how the refresh
works quite yet (the documentation isn't too precise on this). I'm still
figuring this out.

Wes

On Wed, May 10, 2017 at 1:56 AM, Abhishek Girish <[email protected]> wrote:

> Can you share more details of how the data is structured within the S3
> bucket, using some examples? Also some representative queries of what you
> are currently doing and what you hope was possible to do instead? I'm not
> clear on what your question is.
>
> The drill special attributes - filename, dir0, dir1, ... does work for data
> within S3 storage plugin.
>
> On Tue, May 9, 2017 at 7:06 AM, Wesley Chow <[email protected]> wrote:
>
> > What is the recommended way to issue a query against a large number of
> > tables in S3? At the moment I'm aliasing the table as a giant UNION ALL,
> > but is there a better way to do this?
> >
> > Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, but
> > unfortunately I can't simply run the query recursively on an entire day
> of
> > data. I usually need a day of data in a non-UTC time zone. Is there some
> > elegant way to grab that data using the dir0, dir1 magic columns?
> >
> > Thanks,
> > Wes
> >
>

Re: querying from multiple directories in S3

Reply via email to