Suppose that I have a directory structure in S3 like so:

root/YYYY/MM/{lots of files}

Where YYYY and MM are year and month numbers. If I run a query like:

SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';

Does Drill do a scan to find all files in root, thus picking up files from
2016, and then filter them down to ones matching dir0='2017' and dir1='03'
before reading the data? That's what I meant by "scan all the files." Or
does Drill know that it only has to do a scan of files in the 2017/03
directory?

Wes


On Wed, May 10, 2017 at 12:15 PM, Chunhui Shi <[email protected]> wrote:

> I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 =
> 14 AND dir3 > 4)",  and of course you need to add dir0 and dir1 for year
> and month.
>
>
> And what do you mean by "scan all the files on every query", scan all the
> files of one day data, I thought this was your purpose?
>
> ________________________________
> From: Wesley Chow <[email protected]>
> Sent: Wednesday, May 10, 2017 9:04:12 AM
> To: [email protected]
> Subject: Re: querying from multiple directories in S3
>
> I don't think so, because doesn't AND commute, which would mean dir2 = 15
> AND dir2=14 would always be false?
>
> Even if there is some comparison that works, isn't there still an issue
> that the S3 file source has to scan all the files on every query?
>
> Wes
>
> On Wed, May 10, 2017 at 8:15 AM, Charles Givre <[email protected]> wrote:
>
> > Hi Wes,
> > Are you putting the dirX fields in the WHERE clause?
> > IE  Couldn't you do soemthing like:
> >
> > SELECT  <fields>
> > FROM s3.data
> > WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
> >
> > In theory this could work for UTC -4.  It’s ugly… but I think it would
> > work.
> > — C
> >
> >
> >
> > > On May 9, 2017, at 10:06, Wesley Chow <[email protected]> wrote:
> > >
> > > What is the recommended way to issue a query against a large number of
> > > tables in S3? At the moment I'm aliasing the table as a giant UNION
> ALL,
> > > but is there a better way to do this?
> > >
> > > Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC,
> but
> > > unfortunately I can't simply run the query recursively on an entire day
> > of
> > > data. I usually need a day of data in a non-UTC time zone. Is there
> some
> > > elegant way to grab that data using the dir0, dir1 magic columns?
> > >
> > > Thanks,
> > > Wes
> >
> >
>

Reply via email to