That is neat, thanks!

Wes

On Wed, May 10, 2017 at 1:47 PM, Zelaine Fong <[email protected]> wrote:

> Drill will only scan the files in the 2017/03 directory.  See
> https://drill.apache.org/docs/how-to-partition-data/, which describes an
> example very similar to your use case.
>
> -- Zelaine
>
> On 5/10/17, 10:32 AM, "Wesley Chow" <[email protected]> wrote:
>
>     Suppose that I have a directory structure in S3 like so:
>
>     root/YYYY/MM/{lots of files}
>
>     Where YYYY and MM are year and month numbers. If I run a query like:
>
>     SELECT count(1) FROM root WHERE dir0='2017' AND dir1='03';
>
>     Does Drill do a scan to find all files in root, thus picking up files
> from
>     2016, and then filter them down to ones matching dir0='2017' and
> dir1='03'
>     before reading the data? That's what I meant by "scan all the files."
> Or
>     does Drill know that it only has to do a scan of files in the 2017/03
>     directory?
>
>     Wes
>
>
>     On Wed, May 10, 2017 at 12:15 PM, Chunhui Shi <[email protected]> wrote:
>
>     > I think what Charles meant was "WHERE (dir2 = 15 AND dir3 < 20) OR
> (dir2 =
>     > 14 AND dir3 > 4)",  and of course you need to add dir0 and dir1 for
> year
>     > and month.
>     >
>     >
>     > And what do you mean by "scan all the files on every query", scan
> all the
>     > files of one day data, I thought this was your purpose?
>     >
>     > ________________________________
>     > From: Wesley Chow <[email protected]>
>     > Sent: Wednesday, May 10, 2017 9:04:12 AM
>     > To: [email protected]
>     > Subject: Re: querying from multiple directories in S3
>     >
>     > I don't think so, because doesn't AND commute, which would mean dir2
> = 15
>     > AND dir2=14 would always be false?
>     >
>     > Even if there is some comparison that works, isn't there still an
> issue
>     > that the S3 file source has to scan all the files on every query?
>     >
>     > Wes
>     >
>     > On Wed, May 10, 2017 at 8:15 AM, Charles Givre <[email protected]>
> wrote:
>     >
>     > > Hi Wes,
>     > > Are you putting the dirX fields in the WHERE clause?
>     > > IE  Couldn't you do soemthing like:
>     > >
>     > > SELECT  <fields>
>     > > FROM s3.data
>     > > WHERE (dir2 = 15 AND dir3 < 20) AND (dir2 = 14 AND dir3 > 4)
>     > >
>     > > In theory this could work for UTC -4.  It’s ugly… but I think it
> would
>     > > work.
>     > > — C
>     > >
>     > >
>     > >
>     > > > On May 9, 2017, at 10:06, Wesley Chow <[email protected]> wrote:
>     > > >
>     > > > What is the recommended way to issue a query against a large
> number of
>     > > > tables in S3? At the moment I'm aliasing the table as a giant
> UNION
>     > ALL,
>     > > > but is there a better way to do this?
>     > > >
>     > > > Our data is stored as a time hierarchy, like YYYY/MM/DD/HH/MM in
> UTC,
>     > but
>     > > > unfortunately I can't simply run the query recursively on an
> entire day
>     > > of
>     > > > data. I usually need a day of data in a non-UTC time zone. Is
> there
>     > some
>     > > > elegant way to grab that data using the dir0, dir1 magic columns?
>     > > >
>     > > > Thanks,
>     > > > Wes
>     > >
>     > >
>     >
>
>
>

Reply via email to