Re: Querying parquet files

Yousef Lasi Tue, 07 Jul 2015 06:38:46 -0700

Thanks Ted. That fits with my understanding of columnar data stores. I'm trying 
to get a handle on how Drill deals with parquet. Am I correct in assuming that 
it will allocate a thread for each core available to all drill bits and read x 
numbers of columns in parallel? so if we have 48 cores available and the file 
has 48 columns, then the time for the query for a single column should roughly 
equal the time for 48 columns? All other factors, such as data types being the 
same of course.



July 7 2015 2:14 AM, "Ted Dunning" <[email protected]> wrote: 
> How many columns do you have?
> 
> Do you understand about columnar data stores and how selecting only a
> single column means that much less data needs to be read?  If your data
> consists, say, of integers, then Drill only needs to read 160MB to satisfy
> your query which is quite reasonable to be read in a second or less.
> 
> If your records are much wider than that (say 50 columns or so), then
> reading * could easily take a minute, especially if you don't have disk
> bandwidth to read that much data in parallel.
> 
> On Mon, Jul 6, 2015 at 7:11 PM, Yousef Lasi <[email protected]> wrote:
> 
>> I'm hoping someone can expand my understanding of the mechanics of a query
>> against a parquet file. We're finding that selecting a single column in a
>> record from a file with > 40 million records is extremely fast - typically
>> less than a second. However, running a 'select *" query against the same
>> record using the same criteria  is somewhat slow - as in greater than 60
>> seconds.
>> 
>> This might be expected behavior, but hopefully a better understanding of
>> why this occurs might help us optimize the structure of our data files
>> better as we create them.
>> 
>> Thanks

Re: Querying parquet files

Reply via email to