That is neat, thanks! Wes
On Wed, May 10, 2017 at 1:47 PM, Zelaine Fong <[email protected]> wrote: > Drill will only scan the files in the 2017/03 directory. See > https://drill.apache.org/docs/how-to-partition-data/, which describes an > example very similar to your use case. > > -- Zelaine > > On 5/10/17, 10:32 AM, "Wesley Chow" <[email protected]> wrote: > > Suppose that I have a directory structure in S3 like so: > > root/YYYY/MM/{lots of files} > > Where YYYY and MM are year and month numbers. If I run a query like: > > SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03'; > > Does Drill do a scan to find all files in root, thus picking up files > from > 2016, and then filter them down to ones matching dir0='2017' and > dir1='03' > before reading the data? That's what I meant by "scan all the files." > Or > does Drill know that it only has to do a scan of files in the 2017/03 > directory? > > Wes > > > On Wed, May 10, 2017 at 12:15 PM, Chunhui Shi <[email protected]> wrote: > > > I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR > (dir2 = > > 14 AND dir3 > 4)", and of course you need to add dir0 and dir1 for > year > > and month. > > > > > > And what do you mean by "scan all the files on every query", scan > all the > > files of one day data, I thought this was your purpose? > > > > ________________________________ > > From: Wesley Chow <[email protected]> > > Sent: Wednesday, May 10, 2017 9:04:12 AM > > To: [email protected] > > Subject: Re: querying from multiple directories in S3 > > > > I don't think so, because doesn't AND commute, which would mean dir2 > = 15 > > AND dir2=14 would always be false? > > > > Even if there is some comparison that works, isn't there still an > issue > > that the S3 file source has to scan all the files on every query? > > > > Wes > > > > On Wed, May 10, 2017 at 8:15 AM, Charles Givre <[email protected]> > wrote: > > > > > Hi Wes, > > > Are you putting the dirX fields in the WHERE clause? > > > IE Couldn't you do soemthing like: > > > > > > SELECT <fields> > > > FROM s3.data > > > WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4) > > > > > > In theory this could work for UTC -4. It’s ugly… but I think it > would > > > work. > > > — C > > > > > > > > > > > > > On May 9, 2017, at 10:06, Wesley Chow <[email protected]> wrote: > > > > > > > > What is the recommended way to issue a query against a large > number of > > > > tables in S3? At the moment I'm aliasing the table as a giant > UNION > > ALL, > > > > but is there a better way to do this? > > > > > > > > Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in > UTC, > > but > > > > unfortunately I can't simply run the query recursively on an > entire day > > > of > > > > data. I usually need a day of data in a non-UTC time zone. Is > there > > some > > > > elegant way to grab that data using the dir0, dir1 magic columns? > > > > > > > > Thanks, > > > > Wes > > > > > > > > > > >
