You might also want to check out the new partitioned Parquet creation that was launched with 1.1.0: https://drill.apache.org/docs/partition-by-clause/
This would increase your read speed if your queries tend to use predicates. Chris Matta [email protected] 215-701-3146 On Tue, Jul 7, 2015 at 9:38 AM, Yousef Lasi < [email protected]> wrote: > Thanks Ted. That fits with my understanding of columnar data stores. I'm > trying to get a handle on how Drill deals with parquet. Am I correct in > assuming that it will allocate a thread for each core available to all > drill bits and read x numbers of columns in parallel? so if we have 48 > cores available and the file has 48 columns, then the time for the query > for a single column should roughly equal the time for 48 columns? All other > factors, such as data types being the same of course. > > > July 7 2015 2:14 AM, "Ted Dunning" < [email protected]> wrote: > > How many columns do you have? > > > > Do you understand about columnar data stores and how selecting only a > > single column means that much less data needs to be read? If your data > > consists, say, of integers, then Drill only needs to read 160MB to > satisfy > > your query which is quite reasonable to be read in a second or less. > > > > If your records are much wider than that (say 50 columns or so), then > > reading * could easily take a minute, especially if you don't have disk > > bandwidth to read that much data in parallel. > > > > On Mon, Jul 6, 2015 at 7:11 PM, Yousef Lasi < [email protected]> > wrote: > > > >> I'm hoping someone can expand my understanding of the mechanics of a > query > >> against a parquet file. We're finding that selecting a single column in > a > >> record from a file with > 40 million records is extremely fast - > typically > >> less than a second. However, running a 'select *" query against the same > >> record using the same criteria is somewhat slow - as in greater than 60 > >> seconds. > >> > >> This might be expected behavior, but hopefully a better understanding of > >> why this occurs might help us optimize the structure of our data files > >> better as we create them. > >> > >> Thanks >
