Drill will only scan the files in the 2017/03 directory. See https://drill.apache.org/docs/how-to-partition-data/, which describes an example very similar to your use case.
-- Zelaine On 5/10/17, 10:32 AM, "Wesley Chow" <[email protected]> wrote: Suppose that I have a directory structure in S3 like so: root/YYYY/MM/{lots of files} Where YYYY and MM are year and month numbers. If I run a query like: SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03'; Does Drill do a scan to find all files in root, thus picking up files from 2016, and then filter them down to ones matching dir0='2017' and dir1='03' before reading the data? That's what I meant by "scan all the files." Or does Drill know that it only has to do a scan of files in the 2017/03 directory? Wes On Wed, May 10, 2017 at 12:15 PM, Chunhui Shi <[email protected]> wrote: > I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 = > 14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for year > and month. > > > And what do you mean by "scan all the files on every query", scan all the > files of one day data, I thought this was your purpose? > > ________________________________ > From: Wesley Chow <[email protected]> > Sent: Wednesday, May 10, 2017 9:04:12 AM > To: [email protected] > Subject: Re: querying from multiple directories in S3 > > I don't think so, because doesn't AND commute, which would mean dir2 = 15 > AND dir2=14 would always be false? > > Even if there is some comparison that works, isn't there still an issue > that the S3 file source has to scan all the files on every query? > > Wes > > On Wed, May 10, 2017 at 8:15 AM, Charles Givre <[email protected]> wrote: > > > Hi Wes, > > Are you putting the dirX fields in the WHERE clause? > > IE Couldn't you do soemthing like: > > > > SELECT <fields> > > FROM s3.data > > WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4) > > > > In theory this could work for UTC -4. It’s ugly… but I think it would > > work. > > — C > > > > > > > > > On May 9, 2017, at 10:06, Wesley Chow <[email protected]> wrote: > > > > > > What is the recommended way to issue a query against a large number of > > > tables in S3? At the moment I'm aliasing the table as a giant UNION > ALL, > > > but is there a better way to do this? > > > > > > Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC, > but > > > unfortunately I can't simply run the query recursively on an entire day > > of > > > data. I usually need a day of data in a non-UTC time zone. Is there > some > > > elegant way to grab that data using the dir0, dir1 magic columns? > > > > > > Thanks, > > > Wes > > > > >
