Re: Querying parquet files

Ted Dunning Mon, 06 Jul 2015 23:14:29 -0700

How many columns do you have?

Do you understand about columnar data stores and how selecting only a
single column means that much less data needs to be read?  If your data
consists, say, of integers, then Drill only needs to read 160MB to satisfy
your query which is quite reasonable to be read in a second or less.

If your records are much wider than that (say 50 columns or so), then
reading * could easily take a minute, especially if you don't have disk
bandwidth to read that much data in parallel.

On Mon, Jul 6, 2015 at 7:11 PM, Yousef Lasi <[email protected]> wrote:

> I'm hoping someone can expand my understanding of the mechanics of a query
> against a parquet file. We're finding that selecting a single column in a
> record from a file with > 40 million records is extremely fast - typically
> less than a second. However, running a 'select *" query against the same
> record using the same criteria  is somewhat slow - as in greater than 60
> seconds.
>
>  This might be expected behavior, but hopefully a better understanding of
> why this occurs might help us optimize the structure of our data files
> better as we create them.
>
>  Thanks
>

Re: Querying parquet files

Reply via email to