Re: querying from multiple directories in S3

Zelaine Fong Wed, 10 May 2017 10:47:44 -0700

Drill will only scan the files in the 2017/03 directory.  See 
https://drill.apache.org/docs/how-to-partition-data/, which describes an 
example very similar to your use case.


-- Zelaine

On 5/10/17, 10:32 AM, "Wesley Chow" <[email protected]> wrote:

    Suppose that I have a directory structure in S3 like so:
    
    root/YYYY/MM/{lots of files}
    
    Where YYYY and MM are year and month numbers. If I run a query like:
    
    SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';
    
    Does Drill do a scan to find all files in root, thus picking up files from
    2016, and then filter them down to ones matching dir0='2017' and dir1='03'
    before reading the data? That's what I meant by "scan all the files." Or
    does Drill know that it only has to do a scan of files in the 2017/03
    directory?
    
    Wes
    
    
    On Wed, May 10, 2017 at 12:15 PM, Chunhui Shi <[email protected]> wrote:
    
    > I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR (dir2 =
    > 14 AND dir3 > 4)",  and of course you need to add dir0 and dir1 for year
    > and month.
    >
    >
    > And what do you mean by "scan all the files on every query", scan all the
    > files of one day data, I thought this was your purpose?
    >
    > ________________________________
    > From: Wesley Chow <[email protected]>
    > Sent: Wednesday, May 10, 2017 9:04:12 AM
    > To: [email protected]
    > Subject: Re: querying from multiple directories in S3
    >
    > I don't think so, because doesn't AND commute, which would mean dir2 = 15
    > AND dir2=14 would always be false?
    >
    > Even if there is some comparison that works, isn't there still an issue
    > that the S3 file source has to scan all the files on every query?
    >
    > Wes
    >
    > On Wed, May 10, 2017 at 8:15 AM, Charles Givre <[email protected]> wrote:
    >
    > > Hi Wes,
    > > Are you putting the dirX fields in the WHERE clause?
    > > IE  Couldn't you do soemthing like:
    > >
    > > SELECT  <fields>
    > > FROM s3.data
    > > WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
    > >
    > > In theory this could work for UTC -4.  It’s ugly… but I think it would
    > > work.
    > > — C
    > >
    > >
    > >
    > > > On May 9, 2017, at 10:06, Wesley Chow <[email protected]> wrote:
    > > >
    > > > What is the recommended way to issue a query against a large number of
    > > > tables in S3? At the moment I'm aliasing the table as a giant UNION
    > ALL,
    > > > but is there a better way to do this?
    > > >
    > > > Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in UTC,
    > but
    > > > unfortunately I can't simply run the query recursively on an entire 
day
    > > of
    > > > data. I usually need a day of data in a non-UTC time zone. Is there
    > some
    > > > elegant way to grab that data using the dir0, dir1 magic columns?
    > > >
    > > > Thanks,
    > > > Wes
    > >
    > >
    >

Re: querying from multiple directories in S3

Reply via email to