Re: Querying parquet files

Christopher Matta Tue, 07 Jul 2015 07:57:35 -0700

You might also want to check out the new partitioned Parquet creation that
was launched with 1.1.0: https://drill.apache.org/docs/partition-by-clause/


This would increase your read speed if your queries tend to use predicates.

Chris Matta
  [email protected]
215-701-3146

On Tue, Jul 7, 2015 at 9:38 AM, Yousef Lasi <  [email protected]> wrote:

> Thanks Ted. That fits with my understanding of columnar data stores. I'm
> trying to get a handle on how Drill deals with parquet. Am I correct in
> assuming that it will allocate a thread for each core available to all
> drill bits and read x numbers of columns in parallel? so if we have 48
> cores available and the file has 48 columns, then the time for the query
> for a single column should roughly equal the time for 48 columns? All other
> factors, such as data types being the same of course.
>
>
> July 7 2015 2:14 AM, "Ted Dunning" <  [email protected]> wrote:
> > How many columns do you have?
> >
> > Do you understand about columnar data stores and how selecting only a
> > single column means that much less data needs to be read?  If your data
> > consists, say, of integers, then Drill only needs to read 160MB to
> satisfy
> > your query which is quite reasonable to be read in a second or less.
> >
> > If your records are much wider than that (say 50 columns or so), then
> > reading * could easily take a minute, especially if you don't have disk
> > bandwidth to read that much data in parallel.
> >
> > On Mon, Jul 6, 2015 at 7:11 PM, Yousef Lasi <  [email protected]>
> wrote:
> >
> >> I'm hoping someone can expand my understanding of the mechanics of a
> query
> >> against a parquet file. We're finding that selecting a single column in
> a
> >> record from a file with > 40 million records is extremely fast -
> typically
> >> less than a second. However, running a 'select *" query against the same
> >> record using the same criteria  is somewhat slow - as in greater than 60
> >> seconds.
> >>
> >> This might be expected behavior, but hopefully a better understanding of
> >> why this occurs might help us optimize the structure of our data files
> >> better as we create them.
> >>
> >> Thanks
>

Re: Querying parquet files

Reply via email to